diff --git "a/LOG_NODE_RANK_3.log" "b/LOG_NODE_RANK_3.log" new file mode 100644--- /dev/null +++ "b/LOG_NODE_RANK_3.log" @@ -0,0 +1,12006 @@ +Warning: moe_ffn_hidden_size is not set, using ffn_hidden_size for MoE instead. +Warning: moe_ffn_hidden_size is not set, using ffn_hidden_size for MoE instead. +Warning: moe_ffn_hidden_size is not set, using ffn_hidden_size for MoE instead. +Warning: moe_ffn_hidden_size is not set, using ffn_hidden_size for MoE instead. +Warning: moe_ffn_hidden_size is not set, using ffn_hidden_size for MoE instead. +Warning: moe_ffn_hidden_size is not set, using ffn_hidden_size for MoE instead. +Warning: moe_ffn_hidden_size is not set, using ffn_hidden_size for MoE instead. +Warning: moe_ffn_hidden_size is not set, using ffn_hidden_size for MoE instead. +WARNING: TensorBoard writing requested but is not available (are you using PyTorch 1.1.0 or later?), no TensorBoard logs will be written. +WARNING: one_logger package is required to enable e2e metrics tracking. please go to https://confluence.nvidia.com/display/MLWFO/Package+Repositories for details to install it +[rank28]:[W909 16:56:47.583706334 ProcessGroupNCCL.cpp:4715] [PG ID 0 PG GUID 0 Rank 28] using GPU 4 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device. +[rank25]:[W909 16:56:47.586315814 ProcessGroupNCCL.cpp:4715] [PG ID 0 PG GUID 0 Rank 25] using GPU 1 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device. +[rank29]:[W909 16:56:47.587659913 ProcessGroupNCCL.cpp:4715] [PG ID 0 PG GUID 0 Rank 29] using GPU 5 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device. +[rank26]:[W909 16:56:47.587720062 ProcessGroupNCCL.cpp:4715] [PG ID 0 PG GUID 0 Rank 26] using GPU 2 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device. +[rank30]:[W909 16:56:47.587718862 ProcessGroupNCCL.cpp:4715] [PG ID 0 PG GUID 0 Rank 30] using GPU 6 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device. +[rank27]:[W909 16:56:47.587767570 ProcessGroupNCCL.cpp:4715] [PG ID 0 PG GUID 0 Rank 27] using GPU 3 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device. +[rank31]:[W909 16:56:47.623450315 ProcessGroupNCCL.cpp:4715] [PG ID 0 PG GUID 0 Rank 31] using GPU 7 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device. +[rank24]:[W909 16:56:47.705531425 ProcessGroupNCCL.cpp:4715] [PG ID 0 PG GUID 0 Rank 24] using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device. +/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/Ubiquant-Pretrain/megatron/core/transformer/transformer_config.py:779: UserWarning: If you are using transformer_engine as the transformer implementation, the core_attn is from transformer_engine and may be the fused version. For fused attention, you have no need to set 'core_attn' to recompute. Please check that the core_attn recompute is really needed. + warnings.warn( +/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/Ubiquant-Pretrain/megatron/core/transformer/transformer_config.py:779: UserWarning: If you are using transformer_engine as the transformer implementation, the core_attn is from transformer_engine and may be the fused version. For fused attention, you have no need to set 'core_attn' to recompute. Please check that the core_attn recompute is really needed. + warnings.warn( +/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/Ubiquant-Pretrain/megatron/core/transformer/transformer_config.py:779: UserWarning: If you are using transformer_engine as the transformer implementation, the core_attn is from transformer_engine and may be the fused version. For fused attention, you have no need to set 'core_attn' to recompute. Please check that the core_attn recompute is really needed. + warnings.warn( +/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/Ubiquant-Pretrain/megatron/core/transformer/transformer_config.py:779: UserWarning: If you are using transformer_engine as the transformer implementation, the core_attn is from transformer_engine and may be the fused version. For fused attention, you have no need to set 'core_attn' to recompute. Please check that the core_attn recompute is really needed. + warnings.warn( +/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/Ubiquant-Pretrain/megatron/core/transformer/transformer_config.py:779: UserWarning: If you are using transformer_engine as the transformer implementation, the core_attn is from transformer_engine and may be the fused version. For fused attention, you have no need to set 'core_attn' to recompute. Please check that the core_attn recompute is really needed. + warnings.warn( +/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/Ubiquant-Pretrain/megatron/core/transformer/transformer_config.py:779: UserWarning: If you are using transformer_engine as the transformer implementation, the core_attn is from transformer_engine and may be the fused version. For fused attention, you have no need to set 'core_attn' to recompute. Please check that the core_attn recompute is really needed. + warnings.warn( +/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/Ubiquant-Pretrain/megatron/core/transformer/transformer_config.py:779: UserWarning: If you are using transformer_engine as the transformer implementation, the core_attn is from transformer_engine and may be the fused version. For fused attention, you have no need to set 'core_attn' to recompute. Please check that the core_attn recompute is really needed. + warnings.warn( +/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/Ubiquant-Pretrain/megatron/core/transformer/transformer_config.py:779: UserWarning: If you are using transformer_engine as the transformer implementation, the core_attn is from transformer_engine and may be the fused version. For fused attention, you have no need to set 'core_attn' to recompute. Please check that the core_attn recompute is really needed. + warnings.warn( +TransformerBlockSubmodules(layer_specs=[ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={}))], layer_norm=) +TransformerBlockSubmodules(layer_specs=[ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={}))], layer_norm=)TransformerBlockSubmodules(layer_specs=[ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={}))], layer_norm=) +ear'>)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={}))], layer_norm=) + +ear'>)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={}))], layer_norm=)ear'>)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={}))], layer_norm=) +nsformer_engine.TERowParallelLinear'>)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={}))], layer_norm=)ss 'megatron.core.extensions.transformer_engine.TERowParallelLinear'>)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={})), ModuleSpec(module=, params={}, submodules=TransformerLayerSubmodules(input_layernorm=, self_attention=ModuleSpec(module=, params={'attn_mask_type': }, submodules=SelfAttentionSubmodules(linear_qkv=, core_attention=, linear_proj=, q_layernorm=, k_layernorm=)), self_attn_bda=, pre_cross_attn_layernorm=, cross_attention=, cross_attn_bda=, pre_mlp_layernorm=, mlp=ModuleSpec(module=, params={}, submodules=MoESubmodules(experts=ModuleSpec(module=, params={}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)), shared_experts=ModuleSpec(module=, params={'gate': False}, submodules=MLPSubmodules(linear_fc1=, linear_fc2=)))), mlp_bda=, sharded_state_dict_keys_map={}))], layer_norm=) +/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/.venv/lib/python3.11/site-packages/transformer_engine/pytorch/cpu_offload.py:595: DeprecationWarning: Offloading weights is deprecated. Using offload_weights=True does not have any effect. + warnings.warn( +/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/.venv/lib/python3.11/site-packages/transformer_engine/pytorch/cpu_offload.py:595: DeprecationWarning: Offloading weights is deprecated. Using offload_weights=True does not have any effect. + warnings.warn( +/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/.venv/lib/python3.11/site-packages/transformer_engine/pytorch/cpu_offload.py:595: DeprecationWarning: Offloading weights is deprecated. Using offload_weights=True does not have any effect. + warnings.warn( +/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/.venv/lib/python3.11/site-packages/transformer_engine/pytorch/cpu_offload.py:595: DeprecationWarning: Offloading weights is deprecated. Using offload_weights=True does not have any effect. + warnings.warn( +/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/.venv/lib/python3.11/site-packages/transformer_engine/pytorch/cpu_offload.py:595: DeprecationWarning: Offloading weights is deprecated. Using offload_weights=True does not have any effect. + warnings.warn( +/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/.venv/lib/python3.11/site-packages/transformer_engine/pytorch/cpu_offload.py:595: DeprecationWarning: Offloading weights is deprecated. Using offload_weights=True does not have any effect. + warnings.warn( +/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/.venv/lib/python3.11/site-packages/transformer_engine/pytorch/cpu_offload.py:595: DeprecationWarning: Offloading weights is deprecated. Using offload_weights=True does not have any effect. + warnings.warn( +/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/.venv/lib/python3.11/site-packages/transformer_engine/pytorch/cpu_offload.py:595: DeprecationWarning: Offloading weights is deprecated. Using offload_weights=True does not have any effect. + warnings.warn( +(min, max) time across ranks (ms): + load-checkpoint ................................: (76.07, 76.36) +(min, max) time across ranks (ms): + model-and-optimizer-setup ......................: (1270.50, 1768.99) + train/valid/test-data-iterators-setup ..........: (748.35, 1133.72) + [2025-09-09 17:01:38] iteration 1/ 11920 | consumed samples: 1024 | elapsed time per iteration (ms): 146756.5 | throughput per GPU (TFLOP/s/GPU): 3.1 | MFU 0.31% | learning rate: 1.677726E-05 | global batch size: 1024 | lm loss: 1.146968E+01 | loss scale: 1.0 | grad norm: 17.497 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 20 days, 5:53:10.456447 | finish at 2025-09-29 22:54:55 +(min, max) time across ranks (ms): + save-checkpoint ................................: (5157.31, 5157.73) + [2025-09-09 17:01:56] iteration 2/ 11920 | consumed samples: 2048 | elapsed time per iteration (ms): 12643.8 | throughput per GPU (TFLOP/s/GPU): 35.7 | MFU 3.61% | learning rate: 3.355452E-05 | global batch size: 1024 | lm loss: 1.146865E+01 | loss scale: 1.0 | grad norm: 17.439 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1 day, 17:51:28.598372 | finish at 2025-09-11 10:53:24 + [2025-09-09 17:02:01] iteration 3/ 11920 | consumed samples: 3072 | elapsed time per iteration (ms): 5684.8 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 5.033178E-05 | global batch size: 1024 | lm loss: 1.128768E+01 | loss scale: 1.0 | grad norm: 20.217 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:49:05.305925 | finish at 2025-09-10 11:51:07 + [2025-09-09 17:02:07] iteration 4/ 11920 | consumed samples: 4096 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 6.710904E-05 | global batch size: 1024 | lm loss: 1.109622E+01 | loss scale: 1.0 | grad norm: 5.613 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:36:14.509747 | finish at 2025-09-10 11:38:21 + [2025-09-09 17:02:12] iteration 5/ 11920 | consumed samples: 5120 | elapsed time per iteration (ms): 5391.7 | throughput per GPU (TFLOP/s/GPU): 83.7 | MFU 8.47% | learning rate: 8.388630E-05 | global batch size: 1024 | lm loss: 1.096512E+01 | loss scale: 1.0 | grad norm: 3.397 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:50:41.833137 | finish at 2025-09-10 10:52:54 + [2025-09-09 17:02:18] iteration 6/ 11920 | consumed samples: 6144 | elapsed time per iteration (ms): 5653.7 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.07% | learning rate: 1.006636E-04 | global batch size: 1024 | lm loss: 1.109552E+01 | loss scale: 1.0 | grad norm: 226.882 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:42:38.728529 | finish at 2025-09-10 11:44:57 + [2025-09-09 17:02:29] iteration 7/ 11920 | consumed samples: 7168 | elapsed time per iteration (ms): 11387.4 | throughput per GPU (TFLOP/s/GPU): 39.6 | MFU 4.01% | learning rate: 1.174408E-04 | global batch size: 1024 | lm loss: 1.084509E+01 | loss scale: 1.0 | grad norm: 3.210 | num zeros: 2360363.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1 day, 13:40:58.004261 | finish at 2025-09-11 06:43:27 + [2025-09-09 17:02:40] iteration 8/ 11920 | consumed samples: 8192 | elapsed time per iteration (ms): 10653.6 | throughput per GPU (TFLOP/s/GPU): 42.4 | MFU 4.29% | learning rate: 1.342181E-04 | global batch size: 1024 | lm loss: 1.072228E+01 | loss scale: 1.0 | grad norm: 2.906 | num zeros: 9443705.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1 day, 11:15:05.810734 | finish at 2025-09-11 04:17:46 + [2025-09-09 17:02:45] iteration 9/ 11920 | consumed samples: 9216 | elapsed time per iteration (ms): 5391.6 | throughput per GPU (TFLOP/s/GPU): 83.7 | MFU 8.47% | learning rate: 1.509953E-04 | global batch size: 1024 | lm loss: 1.058402E+01 | loss scale: 1.0 | grad norm: 2.861 | num zeros: 18885036.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:50:19.371891 | finish at 2025-09-10 10:53:05 + [2025-09-09 17:02:57] iteration 10/ 11920 | consumed samples: 10240 | elapsed time per iteration (ms): 11347.1 | throughput per GPU (TFLOP/s/GPU): 39.8 | MFU 4.02% | learning rate: 1.677726E-04 | global batch size: 1024 | lm loss: 1.042506E+01 | loss scale: 1.0 | grad norm: 2.834 | num zeros: 28325664.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1 day, 13:32:23.722708 | finish at 2025-09-11 06:35:20 + [2025-09-09 17:03:08] iteration 11/ 11920 | consumed samples: 11264 | elapsed time per iteration (ms): 11199.1 | throughput per GPU (TFLOP/s/GPU): 40.3 | MFU 4.08% | learning rate: 1.845498E-04 | global batch size: 1024 | lm loss: 1.024269E+01 | loss scale: 1.0 | grad norm: 2.807 | num zeros: 30705134.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1 day, 13:02:49.610781 | finish at 2025-09-11 06:05:58 + [2025-09-09 17:03:19] iteration 12/ 11920 | consumed samples: 12288 | elapsed time per iteration (ms): 11404.1 | throughput per GPU (TFLOP/s/GPU): 39.6 | MFU 4.00% | learning rate: 2.013271E-04 | global batch size: 1024 | lm loss: 1.002322E+01 | loss scale: 1.0 | grad norm: 2.838 | num zeros: 77892128.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1 day, 13:43:19.928411 | finish at 2025-09-11 06:46:39 + [2025-09-09 17:03:36] iteration 13/ 11920 | consumed samples: 13312 | elapsed time per iteration (ms): 16527.9 | throughput per GPU (TFLOP/s/GPU): 27.3 | MFU 2.76% | learning rate: 2.181044E-04 | global batch size: 1024 | lm loss: 9.792116E+00 | loss scale: 1.0 | grad norm: 2.786 | num zeros: 70816024.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2 days, 6:39:57.943532 | finish at 2025-09-11 23:43:34 + [2025-09-09 17:03:42] iteration 14/ 11920 | consumed samples: 14336 | elapsed time per iteration (ms): 5835.9 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.348816E-04 | global batch size: 1024 | lm loss: 9.559721E+00 | loss scale: 1.0 | grad norm: 3.316 | num zeros: 80268744.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:18:01.749326 | finish at 2025-09-10 12:21:43 + [2025-09-09 17:03:52] iteration 15/ 11920 | consumed samples: 15360 | elapsed time per iteration (ms): 10681.5 | throughput per GPU (TFLOP/s/GPU): 42.3 | MFU 4.27% | learning rate: 2.516589E-04 | global batch size: 1024 | lm loss: 9.314860E+00 | loss scale: 1.0 | grad norm: 2.735 | num zeros: 103869520.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1 day, 11:19:23.222940 | finish at 2025-09-11 04:23:16 + [2025-09-09 17:03:58] iteration 16/ 11920 | consumed samples: 16384 | elapsed time per iteration (ms): 5424.4 | throughput per GPU (TFLOP/s/GPU): 83.2 | MFU 8.42% | learning rate: 2.684362E-04 | global batch size: 1024 | lm loss: 9.053099E+00 | loss scale: 1.0 | grad norm: 2.762 | num zeros: 108591152.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:56:12.473053 | finish at 2025-09-10 11:00:10 + [2025-09-09 17:04:03] iteration 17/ 11920 | consumed samples: 17408 | elapsed time per iteration (ms): 5070.2 | throughput per GPU (TFLOP/s/GPU): 89.0 | MFU 9.00% | learning rate: 2.852134E-04 | global batch size: 1024 | lm loss: 8.795851E+00 | loss scale: 1.0 | grad norm: 2.719 | num zeros: 115661160.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:45:50.783177 | finish at 2025-09-10 09:49:54 + [2025-09-09 17:04:08] iteration 18/ 11920 | consumed samples: 18432 | elapsed time per iteration (ms): 5299.2 | throughput per GPU (TFLOP/s/GPU): 85.2 | MFU 8.61% | learning rate: 3.019907E-04 | global batch size: 1024 | lm loss: 8.558558E+00 | loss scale: 1.0 | grad norm: 2.831 | num zeros: 113307840.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:31:11.578518 | finish at 2025-09-10 10:35:20 + [2025-09-09 17:04:13] iteration 19/ 11920 | consumed samples: 19456 | elapsed time per iteration (ms): 5225.2 | throughput per GPU (TFLOP/s/GPU): 86.4 | MFU 8.74% | learning rate: 3.187679E-04 | global batch size: 1024 | lm loss: 1.536169E+01 | loss scale: 1.0 | grad norm: 41.775 | num zeros: 37774392.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:16:25.680456 | finish at 2025-09-10 10:20:39 + [2025-09-09 17:04:18] iteration 20/ 11920 | consumed samples: 20480 | elapsed time per iteration (ms): 5026.2 | throughput per GPU (TFLOP/s/GPU): 89.8 | MFU 9.08% | learning rate: 3.355452E-04 | global batch size: 1024 | lm loss: 8.171030E+00 | loss scale: 1.0 | grad norm: 2.656 | num zeros: 132191136.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:36:51.423182 | finish at 2025-09-10 09:41:10 + [2025-09-09 17:04:24] iteration 21/ 11920 | consumed samples: 21504 | elapsed time per iteration (ms): 5099.8 | throughput per GPU (TFLOP/s/GPU): 88.5 | MFU 8.95% | learning rate: 3.523224E-04 | global batch size: 1024 | lm loss: 7.959463E+00 | loss scale: 1.0 | grad norm: 2.433 | num zeros: 136906464.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:51:22.788180 | finish at 2025-09-10 09:55:46 + [2025-09-09 17:04:29] iteration 22/ 11920 | consumed samples: 22528 | elapsed time per iteration (ms): 5105.6 | throughput per GPU (TFLOP/s/GPU): 88.4 | MFU 8.94% | learning rate: 3.690997E-04 | global batch size: 1024 | lm loss: 7.794059E+00 | loss scale: 1.0 | grad norm: 2.203 | num zeros: 103863384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:52:26.901104 | finish at 2025-09-10 09:56:56 + [2025-09-09 17:04:34] iteration 23/ 11920 | consumed samples: 23552 | elapsed time per iteration (ms): 5053.5 | throughput per GPU (TFLOP/s/GPU): 89.3 | MFU 9.03% | learning rate: 3.858769E-04 | global batch size: 1024 | lm loss: 7.623400E+00 | loss scale: 1.0 | grad norm: 2.076 | num zeros: 139263664.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:42:01.063281 | finish at 2025-09-10 09:46:35 + [2025-09-09 17:04:39] iteration 24/ 11920 | consumed samples: 24576 | elapsed time per iteration (ms): 5069.4 | throughput per GPU (TFLOP/s/GPU): 89.1 | MFU 9.01% | learning rate: 4.026542E-04 | global batch size: 1024 | lm loss: 7.503768E+00 | loss scale: 1.0 | grad norm: 2.019 | num zeros: 198273152.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:45:05.135134 | finish at 2025-09-10 09:49:44 + [2025-09-09 17:04:44] iteration 25/ 11920 | consumed samples: 25600 | elapsed time per iteration (ms): 5179.7 | throughput per GPU (TFLOP/s/GPU): 87.2 | MFU 8.81% | learning rate: 4.194315E-04 | global batch size: 1024 | lm loss: 7.368556E+00 | loss scale: 1.0 | grad norm: 2.423 | num zeros: 47211832.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:06:52.144589 | finish at 2025-09-10 10:11:36 + [2025-09-09 17:04:49] iteration 26/ 11920 | consumed samples: 26624 | elapsed time per iteration (ms): 5101.9 | throughput per GPU (TFLOP/s/GPU): 88.5 | MFU 8.95% | learning rate: 4.362087E-04 | global batch size: 1024 | lm loss: 7.291321E+00 | loss scale: 1.0 | grad norm: 0.755 | num zeros: 243125248.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:51:22.110392 | finish at 2025-09-10 09:56:11 + [2025-09-09 17:04:54] iteration 27/ 11920 | consumed samples: 27648 | elapsed time per iteration (ms): 5116.4 | throughput per GPU (TFLOP/s/GPU): 88.2 | MFU 8.92% | learning rate: 4.529860E-04 | global batch size: 1024 | lm loss: 7.247381E+00 | loss scale: 1.0 | grad norm: 0.396 | num zeros: 240766464.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:54:09.166604 | finish at 2025-09-10 09:59:03 + [2025-09-09 17:04:59] iteration 28/ 11920 | consumed samples: 28672 | elapsed time per iteration (ms): 5196.4 | throughput per GPU (TFLOP/s/GPU): 86.9 | MFU 8.79% | learning rate: 4.697632E-04 | global batch size: 1024 | lm loss: 7.349828E+00 | loss scale: 1.0 | grad norm: 4.043 | num zeros: 181760192.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:09:56.007554 | finish at 2025-09-10 10:14:55 + [2025-09-09 17:05:05] iteration 29/ 11920 | consumed samples: 29696 | elapsed time per iteration (ms): 5221.1 | throughput per GPU (TFLOP/s/GPU): 86.5 | MFU 8.74% | learning rate: 4.865405E-04 | global batch size: 1024 | lm loss: 7.283216E+00 | loss scale: 1.0 | grad norm: 1.497 | num zeros: 250214144.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:14:44.656860 | finish at 2025-09-10 10:19:49 + [2025-09-09 17:05:10] iteration 30/ 11920 | consumed samples: 30720 | elapsed time per iteration (ms): 5214.6 | throughput per GPU (TFLOP/s/GPU): 86.6 | MFU 8.75% | learning rate: 5.033177E-04 | global batch size: 1024 | lm loss: 7.314304E+00 | loss scale: 1.0 | grad norm: 1.699 | num zeros: 280878080.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:13:21.546834 | finish at 2025-09-10 10:18:31 + [2025-09-09 17:05:15] iteration 31/ 11920 | consumed samples: 31744 | elapsed time per iteration (ms): 5234.1 | throughput per GPU (TFLOP/s/GPU): 86.3 | MFU 8.72% | learning rate: 5.200951E-04 | global batch size: 1024 | lm loss: 7.316677E+00 | loss scale: 1.0 | grad norm: 1.654 | num zeros: 219512896.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:17:07.833464 | finish at 2025-09-10 10:22:23 + [2025-09-09 17:05:20] iteration 32/ 11920 | consumed samples: 32768 | elapsed time per iteration (ms): 5161.6 | throughput per GPU (TFLOP/s/GPU): 87.5 | MFU 8.84% | learning rate: 5.368723E-04 | global batch size: 1024 | lm loss: 7.259080E+00 | loss scale: 1.0 | grad norm: 0.874 | num zeros: 252558096.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:02:41.314716 | finish at 2025-09-10 10:08:02 + [2025-09-09 17:05:26] iteration 33/ 11920 | consumed samples: 33792 | elapsed time per iteration (ms): 5304.8 | throughput per GPU (TFLOP/s/GPU): 85.1 | MFU 8.61% | learning rate: 5.536496E-04 | global batch size: 1024 | lm loss: 7.500093E+00 | loss scale: 1.0 | grad norm: 2.320 | num zeros: 207708224.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:30:57.605353 | finish at 2025-09-10 10:36:23 + [2025-09-09 17:05:31] iteration 34/ 11920 | consumed samples: 34816 | elapsed time per iteration (ms): 5178.2 | throughput per GPU (TFLOP/s/GPU): 87.2 | MFU 8.82% | learning rate: 5.704268E-04 | global batch size: 1024 | lm loss: 7.182228E+00 | loss scale: 1.0 | grad norm: 1.112 | num zeros: 247840000.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:05:48.394166 | finish at 2025-09-10 10:11:19 + [2025-09-09 17:05:36] iteration 35/ 11920 | consumed samples: 35840 | elapsed time per iteration (ms): 5069.4 | throughput per GPU (TFLOP/s/GPU): 89.1 | MFU 9.01% | learning rate: 5.872041E-04 | global batch size: 1024 | lm loss: 7.204048E+00 | loss scale: 1.0 | grad norm: 1.368 | num zeros: 264363520.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:44:10.219395 | finish at 2025-09-10 09:49:46 + [2025-09-09 17:05:41] iteration 36/ 11920 | consumed samples: 36864 | elapsed time per iteration (ms): 5152.3 | throughput per GPU (TFLOP/s/GPU): 87.6 | MFU 8.86% | learning rate: 6.039813E-04 | global batch size: 1024 | lm loss: 7.110963E+00 | loss scale: 1.0 | grad norm: 0.576 | num zeros: 287963072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:00:29.365112 | finish at 2025-09-10 10:06:10 + [2025-09-09 17:05:46] iteration 37/ 11920 | consumed samples: 37888 | elapsed time per iteration (ms): 5140.7 | throughput per GPU (TFLOP/s/GPU): 87.8 | MFU 8.88% | learning rate: 6.207586E-04 | global batch size: 1024 | lm loss: 7.152319E+00 | loss scale: 1.0 | grad norm: 1.581 | num zeros: 217149648.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:58:06.588002 | finish at 2025-09-10 10:03:53 + [2025-09-09 17:05:51] iteration 38/ 11920 | consumed samples: 38912 | elapsed time per iteration (ms): 5232.1 | throughput per GPU (TFLOP/s/GPU): 86.3 | MFU 8.73% | learning rate: 6.375358E-04 | global batch size: 1024 | lm loss: 7.126531E+00 | loss scale: 1.0 | grad norm: 1.463 | num zeros: 184105008.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:16:07.339225 | finish at 2025-09-10 10:21:59 + [2025-09-09 17:05:56] iteration 39/ 11920 | consumed samples: 39936 | elapsed time per iteration (ms): 5116.2 | throughput per GPU (TFLOP/s/GPU): 88.2 | MFU 8.92% | learning rate: 6.543131E-04 | global batch size: 1024 | lm loss: 7.067020E+00 | loss scale: 1.0 | grad norm: 0.993 | num zeros: 158147168.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:53:04.999651 | finish at 2025-09-10 09:59:01 + [2025-09-09 17:06:02] iteration 40/ 11920 | consumed samples: 40960 | elapsed time per iteration (ms): 5260.2 | throughput per GPU (TFLOP/s/GPU): 85.8 | MFU 8.68% | learning rate: 6.710904E-04 | global batch size: 1024 | lm loss: 7.014977E+00 | loss scale: 1.0 | grad norm: 0.773 | num zeros: 233674752.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:21:30.867548 | finish at 2025-09-10 10:27:33 + [2025-09-09 17:06:07] iteration 41/ 11920 | consumed samples: 41984 | elapsed time per iteration (ms): 5186.5 | throughput per GPU (TFLOP/s/GPU): 87.1 | MFU 8.80% | learning rate: 6.878676E-04 | global batch size: 1024 | lm loss: 7.031491E+00 | loss scale: 1.0 | grad norm: 0.994 | num zeros: 228960352.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:06:50.482508 | finish at 2025-09-10 10:12:57 + [2025-09-09 17:06:12] iteration 42/ 11920 | consumed samples: 43008 | elapsed time per iteration (ms): 5161.8 | throughput per GPU (TFLOP/s/GPU): 87.5 | MFU 8.84% | learning rate: 7.046449E-04 | global batch size: 1024 | lm loss: 6.954517E+00 | loss scale: 1.0 | grad norm: 0.832 | num zeros: 250193920.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:01:51.474160 | finish at 2025-09-10 10:08:03 + [2025-09-09 17:06:17] iteration 43/ 11920 | consumed samples: 44032 | elapsed time per iteration (ms): 5149.2 | throughput per GPU (TFLOP/s/GPU): 87.7 | MFU 8.87% | learning rate: 7.214221E-04 | global batch size: 1024 | lm loss: 6.937999E+00 | loss scale: 1.0 | grad norm: 2.948 | num zeros: 257276464.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:59:16.549577 | finish at 2025-09-10 10:05:34 + [2025-09-09 17:06:22] iteration 44/ 11920 | consumed samples: 45056 | elapsed time per iteration (ms): 5133.8 | throughput per GPU (TFLOP/s/GPU): 87.9 | MFU 8.89% | learning rate: 7.381994E-04 | global batch size: 1024 | lm loss: 7.025707E+00 | loss scale: 1.0 | grad norm: 2.077 | num zeros: 252557408.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:56:09.467850 | finish at 2025-09-10 10:02:32 + [2025-09-09 17:06:27] iteration 45/ 11920 | consumed samples: 46080 | elapsed time per iteration (ms): 5147.5 | throughput per GPU (TFLOP/s/GPU): 87.7 | MFU 8.87% | learning rate: 7.549766E-04 | global batch size: 1024 | lm loss: 6.956869E+00 | loss scale: 1.0 | grad norm: 2.124 | num zeros: 264362752.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:58:46.158088 | finish at 2025-09-10 10:05:14 + [2025-09-09 17:06:33] iteration 46/ 11920 | consumed samples: 47104 | elapsed time per iteration (ms): 5122.9 | throughput per GPU (TFLOP/s/GPU): 88.1 | MFU 8.91% | learning rate: 7.717539E-04 | global batch size: 1024 | lm loss: 6.929164E+00 | loss scale: 1.0 | grad norm: 1.075 | num zeros: 252556592.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:53:49.272244 | finish at 2025-09-10 10:00:22 + [2025-09-09 17:06:38] iteration 47/ 11920 | consumed samples: 48128 | elapsed time per iteration (ms): 5196.3 | throughput per GPU (TFLOP/s/GPU): 86.9 | MFU 8.79% | learning rate: 7.885311E-04 | global batch size: 1024 | lm loss: 6.863750E+00 | loss scale: 1.0 | grad norm: 0.805 | num zeros: 240752640.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:08:15.378687 | finish at 2025-09-10 10:14:53 + [2025-09-09 17:06:43] iteration 48/ 11920 | consumed samples: 49152 | elapsed time per iteration (ms): 5194.2 | throughput per GPU (TFLOP/s/GPU): 86.9 | MFU 8.79% | learning rate: 8.053084E-04 | global batch size: 1024 | lm loss: 6.861452E+00 | loss scale: 1.0 | grad norm: 1.150 | num zeros: 259635472.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:07:45.956116 | finish at 2025-09-10 10:14:29 + [2025-09-09 17:06:48] iteration 49/ 11920 | consumed samples: 50176 | elapsed time per iteration (ms): 5191.4 | throughput per GPU (TFLOP/s/GPU): 87.0 | MFU 8.79% | learning rate: 8.220857E-04 | global batch size: 1024 | lm loss: 6.925682E+00 | loss scale: 1.0 | grad norm: 2.929 | num zeros: 264357376.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:07:07.480772 | finish at 2025-09-10 10:13:56 + [2025-09-09 17:06:53] iteration 50/ 11920 | consumed samples: 51200 | elapsed time per iteration (ms): 5079.9 | throughput per GPU (TFLOP/s/GPU): 88.9 | MFU 8.99% | learning rate: 8.388630E-04 | global batch size: 1024 | lm loss: 7.015708E+00 | loss scale: 1.0 | grad norm: 2.697 | num zeros: 221870112.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:44:58.809516 | finish at 2025-09-10 09:51:52 + [2025-09-09 17:06:58] iteration 51/ 11920 | consumed samples: 52224 | elapsed time per iteration (ms): 5150.0 | throughput per GPU (TFLOP/s/GPU): 87.7 | MFU 8.86% | learning rate: 8.556402E-04 | global batch size: 1024 | lm loss: 7.188289E+00 | loss scale: 1.0 | grad norm: 4.379 | num zeros: 228952624.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:58:44.912514 | finish at 2025-09-10 10:05:43 + [2025-09-09 17:07:04] iteration 52/ 11920 | consumed samples: 53248 | elapsed time per iteration (ms): 5152.0 | throughput per GPU (TFLOP/s/GPU): 87.6 | MFU 8.86% | learning rate: 8.724175E-04 | global batch size: 1024 | lm loss: 7.104446E+00 | loss scale: 1.0 | grad norm: 4.592 | num zeros: 172306496.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:59:03.406285 | finish at 2025-09-10 10:06:07 + [2025-09-09 17:07:09] iteration 53/ 11920 | consumed samples: 54272 | elapsed time per iteration (ms): 5162.7 | throughput per GPU (TFLOP/s/GPU): 87.5 | MFU 8.84% | learning rate: 8.891947E-04 | global batch size: 1024 | lm loss: 7.010694E+00 | loss scale: 1.0 | grad norm: 1.516 | num zeros: 160502624.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:01:05.454596 | finish at 2025-09-10 10:08:14 + [2025-09-09 17:07:14] iteration 54/ 11920 | consumed samples: 55296 | elapsed time per iteration (ms): 5182.8 | throughput per GPU (TFLOP/s/GPU): 87.1 | MFU 8.81% | learning rate: 9.059720E-04 | global batch size: 1024 | lm loss: 6.981435E+00 | loss scale: 1.0 | grad norm: 1.035 | num zeros: 195912768.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:04:59.012088 | finish at 2025-09-10 10:12:13 + [2025-09-09 17:07:19] iteration 55/ 11920 | consumed samples: 56320 | elapsed time per iteration (ms): 5264.2 | throughput per GPU (TFLOP/s/GPU): 85.8 | MFU 8.67% | learning rate: 9.227492E-04 | global batch size: 1024 | lm loss: 7.055523E+00 | loss scale: 1.0 | grad norm: 1.983 | num zeros: 202990624.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:20:59.220650 | finish at 2025-09-10 10:28:18 + [2025-09-09 17:07:24] iteration 56/ 11920 | consumed samples: 57344 | elapsed time per iteration (ms): 5184.9 | throughput per GPU (TFLOP/s/GPU): 87.1 | MFU 8.80% | learning rate: 9.395265E-04 | global batch size: 1024 | lm loss: 7.052476E+00 | loss scale: 1.0 | grad norm: 1.395 | num zeros: 205354880.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:05:13.603224 | finish at 2025-09-10 10:12:38 + [2025-09-09 17:07:29] iteration 57/ 11920 | consumed samples: 58368 | elapsed time per iteration (ms): 5160.8 | throughput per GPU (TFLOP/s/GPU): 87.5 | MFU 8.85% | learning rate: 9.563037E-04 | global batch size: 1024 | lm loss: 6.958776E+00 | loss scale: 1.0 | grad norm: 1.415 | num zeros: 228951904.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:00:22.700269 | finish at 2025-09-10 10:07:52 + [2025-09-09 17:07:35] iteration 58/ 11920 | consumed samples: 59392 | elapsed time per iteration (ms): 5178.7 | throughput per GPU (TFLOP/s/GPU): 87.2 | MFU 8.82% | learning rate: 9.730810E-04 | global batch size: 1024 | lm loss: 6.924489E+00 | loss scale: 1.0 | grad norm: 0.976 | num zeros: 236040000.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:03:49.931359 | finish at 2025-09-10 10:11:25 + [2025-09-09 17:07:40] iteration 59/ 11920 | consumed samples: 60416 | elapsed time per iteration (ms): 5226.3 | throughput per GPU (TFLOP/s/GPU): 86.4 | MFU 8.73% | learning rate: 9.898583E-04 | global batch size: 1024 | lm loss: 6.822841E+00 | loss scale: 1.0 | grad norm: 0.625 | num zeros: 257274912.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:13:09.478004 | finish at 2025-09-10 10:20:49 + [2025-09-09 17:07:45] iteration 60/ 11920 | consumed samples: 61440 | elapsed time per iteration (ms): 5241.7 | throughput per GPU (TFLOP/s/GPU): 86.1 | MFU 8.71% | learning rate: 1.006635E-03 | global batch size: 1024 | lm loss: 6.819909E+00 | loss scale: 1.0 | grad norm: 0.721 | num zeros: 254915504.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:16:05.975895 | finish at 2025-09-10 10:23:51 + [2025-09-09 17:07:50] iteration 61/ 11920 | consumed samples: 62464 | elapsed time per iteration (ms): 5216.2 | throughput per GPU (TFLOP/s/GPU): 86.6 | MFU 8.75% | learning rate: 1.023413E-03 | global batch size: 1024 | lm loss: 6.809915E+00 | loss scale: 1.0 | grad norm: 1.047 | num zeros: 252559328.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:10:58.880387 | finish at 2025-09-10 10:18:49 + [2025-09-09 17:07:56] iteration 62/ 11920 | consumed samples: 63488 | elapsed time per iteration (ms): 5313.8 | throughput per GPU (TFLOP/s/GPU): 85.0 | MFU 8.59% | learning rate: 1.040190E-03 | global batch size: 1024 | lm loss: 6.752289E+00 | loss scale: 1.0 | grad norm: 0.678 | num zeros: 259636256.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:30:11.312953 | finish at 2025-09-10 10:38:07 + [2025-09-09 17:08:01] iteration 63/ 11920 | consumed samples: 64512 | elapsed time per iteration (ms): 5275.2 | throughput per GPU (TFLOP/s/GPU): 85.6 | MFU 8.65% | learning rate: 1.056967E-03 | global batch size: 1024 | lm loss: 6.723064E+00 | loss scale: 1.0 | grad norm: 0.470 | num zeros: 259639840.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:22:28.392810 | finish at 2025-09-10 10:30:29 + [2025-09-09 17:08:06] iteration 64/ 11920 | consumed samples: 65536 | elapsed time per iteration (ms): 5222.1 | throughput per GPU (TFLOP/s/GPU): 86.5 | MFU 8.74% | learning rate: 1.073745E-03 | global batch size: 1024 | lm loss: 6.738349E+00 | loss scale: 1.0 | grad norm: 0.630 | num zeros: 254914592.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:11:53.008656 | finish at 2025-09-10 10:19:59 + [2025-09-09 17:08:12] iteration 65/ 11920 | consumed samples: 66560 | elapsed time per iteration (ms): 5569.7 | throughput per GPU (TFLOP/s/GPU): 81.1 | MFU 8.20% | learning rate: 1.090522E-03 | global batch size: 1024 | lm loss: 6.679352E+00 | loss scale: 1.0 | grad norm: 0.722 | num zeros: 259635328.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:20:28.584374 | finish at 2025-09-10 11:28:40 + [2025-09-09 17:08:17] iteration 66/ 11920 | consumed samples: 67584 | elapsed time per iteration (ms): 5258.7 | throughput per GPU (TFLOP/s/GPU): 85.9 | MFU 8.68% | learning rate: 1.107299E-03 | global batch size: 1024 | lm loss: 6.652689E+00 | loss scale: 1.0 | grad norm: 0.410 | num zeros: 259636816.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:18:56.775506 | finish at 2025-09-10 10:27:14 + [2025-09-09 17:08:22] iteration 67/ 11920 | consumed samples: 68608 | elapsed time per iteration (ms): 5251.9 | throughput per GPU (TFLOP/s/GPU): 86.0 | MFU 8.69% | learning rate: 1.124076E-03 | global batch size: 1024 | lm loss: 6.651512E+00 | loss scale: 1.0 | grad norm: 0.738 | num zeros: 245474864.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:17:30.278478 | finish at 2025-09-10 10:25:53 + [2025-09-09 17:08:27] iteration 68/ 11920 | consumed samples: 69632 | elapsed time per iteration (ms): 5257.7 | throughput per GPU (TFLOP/s/GPU): 85.9 | MFU 8.68% | learning rate: 1.140854E-03 | global batch size: 1024 | lm loss: 6.617465E+00 | loss scale: 1.0 | grad norm: 0.540 | num zeros: 250195520.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:18:33.994383 | finish at 2025-09-10 10:27:01 + [2025-09-09 17:08:33] iteration 69/ 11920 | consumed samples: 70656 | elapsed time per iteration (ms): 5223.8 | throughput per GPU (TFLOP/s/GPU): 86.4 | MFU 8.74% | learning rate: 1.157631E-03 | global batch size: 1024 | lm loss: 6.624822E+00 | loss scale: 1.0 | grad norm: 0.685 | num zeros: 228954912.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:11:47.227706 | finish at 2025-09-10 10:20:20 + [2025-09-09 17:08:38] iteration 70/ 11920 | consumed samples: 71680 | elapsed time per iteration (ms): 5237.9 | throughput per GPU (TFLOP/s/GPU): 86.2 | MFU 8.72% | learning rate: 1.174408E-03 | global batch size: 1024 | lm loss: 6.611450E+00 | loss scale: 1.0 | grad norm: 0.722 | num zeros: 276175168.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:14:28.730986 | finish at 2025-09-10 10:23:07 + [2025-09-09 17:08:43] iteration 71/ 11920 | consumed samples: 72704 | elapsed time per iteration (ms): 5251.6 | throughput per GPU (TFLOP/s/GPU): 86.0 | MFU 8.69% | learning rate: 1.191185E-03 | global batch size: 1024 | lm loss: 6.549622E+00 | loss scale: 1.0 | grad norm: 0.465 | num zeros: 273802560.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:17:05.655017 | finish at 2025-09-10 10:25:49 + [2025-09-09 17:08:48] iteration 72/ 11920 | consumed samples: 73728 | elapsed time per iteration (ms): 5235.5 | throughput per GPU (TFLOP/s/GPU): 86.2 | MFU 8.72% | learning rate: 1.207963E-03 | global batch size: 1024 | lm loss: 6.600683E+00 | loss scale: 1.0 | grad norm: 0.738 | num zeros: 228951936.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:13:50.244699 | finish at 2025-09-10 10:22:39 + [2025-09-09 17:08:54] iteration 73/ 11920 | consumed samples: 74752 | elapsed time per iteration (ms): 5214.5 | throughput per GPU (TFLOP/s/GPU): 86.6 | MFU 8.75% | learning rate: 1.224740E-03 | global batch size: 1024 | lm loss: 6.558777E+00 | loss scale: 1.0 | grad norm: 0.477 | num zeros: 243113824.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:09:35.861739 | finish at 2025-09-10 10:18:30 + [2025-09-09 17:08:59] iteration 74/ 11920 | consumed samples: 75776 | elapsed time per iteration (ms): 5195.7 | throughput per GPU (TFLOP/s/GPU): 86.9 | MFU 8.79% | learning rate: 1.241517E-03 | global batch size: 1024 | lm loss: 6.539164E+00 | loss scale: 1.0 | grad norm: 0.627 | num zeros: 276159136.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:05:47.713458 | finish at 2025-09-10 10:14:47 + [2025-09-09 17:09:04] iteration 75/ 11920 | consumed samples: 76800 | elapsed time per iteration (ms): 5214.0 | throughput per GPU (TFLOP/s/GPU): 86.6 | MFU 8.76% | learning rate: 1.258294E-03 | global batch size: 1024 | lm loss: 6.530567E+00 | loss scale: 1.0 | grad norm: 0.463 | num zeros: 269077888.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:09:20.369239 | finish at 2025-09-10 10:18:24 + [2025-09-09 17:09:09] iteration 76/ 11920 | consumed samples: 77824 | elapsed time per iteration (ms): 5258.2 | throughput per GPU (TFLOP/s/GPU): 85.9 | MFU 8.68% | learning rate: 1.275072E-03 | global batch size: 1024 | lm loss: 6.545404E+00 | loss scale: 1.0 | grad norm: 0.596 | num zeros: 228954352.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:17:58.543548 | finish at 2025-09-10 10:27:08 + [2025-09-09 17:09:15] iteration 77/ 11920 | consumed samples: 78848 | elapsed time per iteration (ms): 5266.6 | throughput per GPU (TFLOP/s/GPU): 85.7 | MFU 8.67% | learning rate: 1.291849E-03 | global batch size: 1024 | lm loss: 6.550747E+00 | loss scale: 1.0 | grad norm: 0.786 | num zeros: 247836160.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:19:31.811704 | finish at 2025-09-10 10:28:46 + [2025-09-09 17:09:20] iteration 78/ 11920 | consumed samples: 79872 | elapsed time per iteration (ms): 5257.9 | throughput per GPU (TFLOP/s/GPU): 85.9 | MFU 8.68% | learning rate: 1.308626E-03 | global batch size: 1024 | lm loss: 6.570714E+00 | loss scale: 1.0 | grad norm: 0.498 | num zeros: 290319456.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:17:44.176023 | finish at 2025-09-10 10:27:04 + [2025-09-09 17:09:25] iteration 79/ 11920 | consumed samples: 80896 | elapsed time per iteration (ms): 5284.7 | throughput per GPU (TFLOP/s/GPU): 85.4 | MFU 8.64% | learning rate: 1.325403E-03 | global batch size: 1024 | lm loss: 6.594672E+00 | loss scale: 1.0 | grad norm: 0.571 | num zeros: 269081184.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:22:56.354741 | finish at 2025-09-10 10:32:21 + [2025-09-09 17:09:30] iteration 80/ 11920 | consumed samples: 81920 | elapsed time per iteration (ms): 5274.7 | throughput per GPU (TFLOP/s/GPU): 85.6 | MFU 8.65% | learning rate: 1.342181E-03 | global batch size: 1024 | lm loss: 6.642711E+00 | loss scale: 1.0 | grad norm: 0.744 | num zeros: 262000928.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:20:52.898788 | finish at 2025-09-10 10:30:23 + [2025-09-09 17:09:36] iteration 81/ 11920 | consumed samples: 82944 | elapsed time per iteration (ms): 5286.6 | throughput per GPU (TFLOP/s/GPU): 85.4 | MFU 8.64% | learning rate: 1.358958E-03 | global batch size: 1024 | lm loss: 6.631994E+00 | loss scale: 1.0 | grad norm: 0.839 | num zeros: 193550368.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:23:08.544230 | finish at 2025-09-10 10:32:44 + [2025-09-09 17:09:41] iteration 82/ 11920 | consumed samples: 83968 | elapsed time per iteration (ms): 5290.7 | throughput per GPU (TFLOP/s/GPU): 85.3 | MFU 8.63% | learning rate: 1.375735E-03 | global batch size: 1024 | lm loss: 6.587546E+00 | loss scale: 1.0 | grad norm: 0.784 | num zeros: 179386784.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:23:51.153703 | finish at 2025-09-10 10:33:32 + [2025-09-09 17:09:46] iteration 83/ 11920 | consumed samples: 84992 | elapsed time per iteration (ms): 5292.3 | throughput per GPU (TFLOP/s/GPU): 85.3 | MFU 8.63% | learning rate: 1.392512E-03 | global batch size: 1024 | lm loss: 6.557182E+00 | loss scale: 1.0 | grad norm: 0.912 | num zeros: 191191360.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:24:04.407434 | finish at 2025-09-10 10:33:51 + [2025-09-09 17:09:52] iteration 84/ 11920 | consumed samples: 86016 | elapsed time per iteration (ms): 5296.9 | throughput per GPU (TFLOP/s/GPU): 85.2 | MFU 8.62% | learning rate: 1.409290E-03 | global batch size: 1024 | lm loss: 6.558598E+00 | loss scale: 1.0 | grad norm: 0.795 | num zeros: 179386848.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:24:54.295049 | finish at 2025-09-10 10:34:46 + [2025-09-09 17:09:57] iteration 85/ 11920 | consumed samples: 87040 | elapsed time per iteration (ms): 5301.5 | throughput per GPU (TFLOP/s/GPU): 85.2 | MFU 8.61% | learning rate: 1.426067E-03 | global batch size: 1024 | lm loss: 6.592675E+00 | loss scale: 1.0 | grad norm: 0.963 | num zeros: 198266960.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:25:43.479205 | finish at 2025-09-10 10:35:40 + [2025-09-09 17:10:02] iteration 86/ 11920 | consumed samples: 88064 | elapsed time per iteration (ms): 5321.4 | throughput per GPU (TFLOP/s/GPU): 84.8 | MFU 8.58% | learning rate: 1.442844E-03 | global batch size: 1024 | lm loss: 6.539623E+00 | loss scale: 1.0 | grad norm: 0.848 | num zeros: 195906688.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:29:33.331059 | finish at 2025-09-10 10:39:36 + [2025-09-09 17:10:08] iteration 87/ 11920 | consumed samples: 89088 | elapsed time per iteration (ms): 5316.4 | throughput per GPU (TFLOP/s/GPU): 84.9 | MFU 8.59% | learning rate: 1.459621E-03 | global batch size: 1024 | lm loss: 6.511465E+00 | loss scale: 1.0 | grad norm: 0.656 | num zeros: 212430560.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:28:28.547087 | finish at 2025-09-10 10:38:36 + [2025-09-09 17:10:13] iteration 88/ 11920 | consumed samples: 90112 | elapsed time per iteration (ms): 5326.7 | throughput per GPU (TFLOP/s/GPU): 84.8 | MFU 8.57% | learning rate: 1.476399E-03 | global batch size: 1024 | lm loss: 6.475482E+00 | loss scale: 1.0 | grad norm: 0.511 | num zeros: 231313008.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:30:25.404053 | finish at 2025-09-10 10:40:38 + [2025-09-09 17:10:18] iteration 89/ 11920 | consumed samples: 91136 | elapsed time per iteration (ms): 5319.0 | throughput per GPU (TFLOP/s/GPU): 84.9 | MFU 8.58% | learning rate: 1.493176E-03 | global batch size: 1024 | lm loss: 6.448763E+00 | loss scale: 1.0 | grad norm: 0.464 | num zeros: 259635200.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:28:49.467046 | finish at 2025-09-10 10:39:08 + [2025-09-09 17:10:24] iteration 90/ 11920 | consumed samples: 92160 | elapsed time per iteration (ms): 5349.1 | throughput per GPU (TFLOP/s/GPU): 84.4 | MFU 8.53% | learning rate: 1.509953E-03 | global batch size: 1024 | lm loss: 6.446258E+00 | loss scale: 1.0 | grad norm: 0.469 | num zeros: 245476384.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:34:39.346647 | finish at 2025-09-10 10:45:03 + [2025-09-09 17:10:29] iteration 91/ 11920 | consumed samples: 93184 | elapsed time per iteration (ms): 5299.8 | throughput per GPU (TFLOP/s/GPU): 85.2 | MFU 8.61% | learning rate: 1.526731E-03 | global batch size: 1024 | lm loss: 6.433271E+00 | loss scale: 1.0 | grad norm: 0.495 | num zeros: 233671808.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:24:51.158388 | finish at 2025-09-10 10:35:20 + [2025-09-09 17:10:34] iteration 92/ 11920 | consumed samples: 94208 | elapsed time per iteration (ms): 5329.8 | throughput per GPU (TFLOP/s/GPU): 84.7 | MFU 8.57% | learning rate: 1.543508E-03 | global batch size: 1024 | lm loss: 6.412926E+00 | loss scale: 1.0 | grad norm: 0.522 | num zeros: 221873920.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:30:41.437108 | finish at 2025-09-10 10:41:16 + [2025-09-09 17:10:40] iteration 93/ 11920 | consumed samples: 95232 | elapsed time per iteration (ms): 5363.2 | throughput per GPU (TFLOP/s/GPU): 84.2 | MFU 8.51% | learning rate: 1.560285E-03 | global batch size: 1024 | lm loss: 6.408752E+00 | loss scale: 1.0 | grad norm: 0.529 | num zeros: 221870144.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:37:10.131554 | finish at 2025-09-10 10:47:50 + [2025-09-09 17:10:45] iteration 94/ 11920 | consumed samples: 96256 | elapsed time per iteration (ms): 5343.2 | throughput per GPU (TFLOP/s/GPU): 84.5 | MFU 8.54% | learning rate: 1.577062E-03 | global batch size: 1024 | lm loss: 6.365674E+00 | loss scale: 1.0 | grad norm: 0.560 | num zeros: 205351824.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:33:09.272109 | finish at 2025-09-10 10:43:54 + [2025-09-09 17:10:50] iteration 95/ 11920 | consumed samples: 97280 | elapsed time per iteration (ms): 5358.4 | throughput per GPU (TFLOP/s/GPU): 84.3 | MFU 8.52% | learning rate: 1.593840E-03 | global batch size: 1024 | lm loss: 6.353600E+00 | loss scale: 1.0 | grad norm: 0.284 | num zeros: 188825696.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:36:02.587881 | finish at 2025-09-10 10:46:53 + [2025-09-09 17:10:56] iteration 96/ 11920 | consumed samples: 98304 | elapsed time per iteration (ms): 5379.5 | throughput per GPU (TFLOP/s/GPU): 83.9 | MFU 8.49% | learning rate: 1.610617E-03 | global batch size: 1024 | lm loss: 6.334913E+00 | loss scale: 1.0 | grad norm: 0.504 | num zeros: 165227024.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:40:07.477592 | finish at 2025-09-10 10:51:03 + [2025-09-09 17:11:01] iteration 97/ 11920 | consumed samples: 99328 | elapsed time per iteration (ms): 5374.7 | throughput per GPU (TFLOP/s/GPU): 84.0 | MFU 8.49% | learning rate: 1.627394E-03 | global batch size: 1024 | lm loss: 6.326926E+00 | loss scale: 1.0 | grad norm: 0.429 | num zeros: 158146848.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:39:04.557439 | finish at 2025-09-10 10:50:06 + [2025-09-09 17:11:07] iteration 98/ 11920 | consumed samples: 100352 | elapsed time per iteration (ms): 5874.8 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.77% | learning rate: 1.644171E-03 | global batch size: 1024 | lm loss: 6.311665E+00 | loss scale: 1.0 | grad norm: 0.300 | num zeros: 146340000.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:17:31.434234 | finish at 2025-09-10 12:28:38 + [2025-09-09 17:11:13] iteration 99/ 11920 | consumed samples: 101376 | elapsed time per iteration (ms): 5959.6 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 1.660949E-03 | global batch size: 1024 | lm loss: 6.298758E+00 | loss scale: 1.0 | grad norm: 0.291 | num zeros: 110939088.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:34:07.971876 | finish at 2025-09-10 12:45:21 + [2025-09-09 17:11:18] iteration 100/ 11920 | consumed samples: 102400 | elapsed time per iteration (ms): 5400.7 | throughput per GPU (TFLOP/s/GPU): 83.6 | MFU 8.45% | learning rate: 1.677726E-03 | global batch size: 1024 | lm loss: 6.279841E+00 | loss scale: 1.0 | grad norm: 0.307 | num zeros: 92055040.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:43:56.821804 | finish at 2025-09-10 10:55:15 + [2025-09-09 17:11:24] iteration 101/ 11920 | consumed samples: 103424 | elapsed time per iteration (ms): 5647.9 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 1.694503E-03 | global batch size: 1024 | lm loss: 6.302172E+00 | loss scale: 1.0 | grad norm: 0.251 | num zeros: 80255816.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:32:31.995940 | finish at 2025-09-10 11:43:56 + [2025-09-09 17:11:30] iteration 102/ 11920 | consumed samples: 104448 | elapsed time per iteration (ms): 6218.0 | throughput per GPU (TFLOP/s/GPU): 72.6 | MFU 7.34% | learning rate: 1.711280E-03 | global batch size: 1024 | lm loss: 6.296673E+00 | loss scale: 1.0 | grad norm: 0.350 | num zeros: 143982880.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 20:24:44.466933 | finish at 2025-09-10 13:36:15 + [2025-09-09 17:11:35] iteration 103/ 11920 | consumed samples: 105472 | elapsed time per iteration (ms): 5401.1 | throughput per GPU (TFLOP/s/GPU): 83.6 | MFU 8.45% | learning rate: 1.728058E-03 | global batch size: 1024 | lm loss: 6.262599E+00 | loss scale: 1.0 | grad norm: 0.281 | num zeros: 118023216.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:43:44.476575 | finish at 2025-09-10 10:55:20 + [2025-09-09 17:11:41] iteration 104/ 11920 | consumed samples: 106496 | elapsed time per iteration (ms): 5703.7 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.00% | learning rate: 1.744835E-03 | global batch size: 1024 | lm loss: 6.242532E+00 | loss scale: 1.0 | grad norm: 0.356 | num zeros: 108579904.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:43:14.418522 | finish at 2025-09-10 11:54:56 + [2025-09-09 17:11:47] iteration 105/ 11920 | consumed samples: 107520 | elapsed time per iteration (ms): 5996.8 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.61% | learning rate: 1.761612E-03 | global batch size: 1024 | lm loss: 6.233445E+00 | loss scale: 1.0 | grad norm: 0.541 | num zeros: 80253280.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:40:52.273051 | finish at 2025-09-10 12:52:39 + [2025-09-09 17:11:53] iteration 106/ 11920 | consumed samples: 108544 | elapsed time per iteration (ms): 5696.2 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.01% | learning rate: 1.778389E-03 | global batch size: 1024 | lm loss: 6.195497E+00 | loss scale: 1.0 | grad norm: 0.358 | num zeros: 80262560.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:41:35.302699 | finish at 2025-09-10 11:53:28 + [2025-09-09 17:11:59] iteration 107/ 11920 | consumed samples: 109568 | elapsed time per iteration (ms): 5683.9 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 1.795167E-03 | global batch size: 1024 | lm loss: 6.240591E+00 | loss scale: 1.0 | grad norm: 0.765 | num zeros: 132182832.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:39:03.881112 | finish at 2025-09-10 11:51:02 + [2025-09-09 17:12:04] iteration 108/ 11920 | consumed samples: 110592 | elapsed time per iteration (ms): 5654.4 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 1.811944E-03 | global batch size: 1024 | lm loss: 6.235484E+00 | loss scale: 1.0 | grad norm: 0.699 | num zeros: 118018424.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:33:09.867037 | finish at 2025-09-10 11:45:14 + [2025-09-09 17:12:10] iteration 109/ 11920 | consumed samples: 111616 | elapsed time per iteration (ms): 5730.8 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.97% | learning rate: 1.828721E-03 | global batch size: 1024 | lm loss: 6.185876E+00 | loss scale: 1.0 | grad norm: 0.437 | num zeros: 115659672.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:48:06.891724 | finish at 2025-09-10 12:00:17 + [2025-09-09 17:12:15] iteration 110/ 11920 | consumed samples: 112640 | elapsed time per iteration (ms): 5396.4 | throughput per GPU (TFLOP/s/GPU): 83.7 | MFU 8.46% | learning rate: 1.845498E-03 | global batch size: 1024 | lm loss: 6.170319E+00 | loss scale: 1.0 | grad norm: 0.538 | num zeros: 113296240.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:42:11.511860 | finish at 2025-09-10 10:54:27 + [2025-09-09 17:12:21] iteration 111/ 11920 | consumed samples: 113664 | elapsed time per iteration (ms): 5657.3 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 1.862276E-03 | global batch size: 1024 | lm loss: 6.180406E+00 | loss scale: 1.0 | grad norm: 0.735 | num zeros: 94414368.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:33:27.458260 | finish at 2025-09-10 11:45:48 + [2025-09-09 17:12:27] iteration 112/ 11920 | consumed samples: 114688 | elapsed time per iteration (ms): 5674.8 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.04% | learning rate: 1.879053E-03 | global batch size: 1024 | lm loss: 6.159251E+00 | loss scale: 1.0 | grad norm: 0.528 | num zeros: 115656496.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:36:47.750290 | finish at 2025-09-10 11:49:14 + [2025-09-09 17:12:32] iteration 113/ 11920 | consumed samples: 115712 | elapsed time per iteration (ms): 5637.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 1.895830E-03 | global batch size: 1024 | lm loss: 6.156834E+00 | loss scale: 1.0 | grad norm: 0.368 | num zeros: 110935184.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:29:23.342417 | finish at 2025-09-10 11:41:56 + [2025-09-09 17:12:38] iteration 114/ 11920 | consumed samples: 116736 | elapsed time per iteration (ms): 6055.5 | throughput per GPU (TFLOP/s/GPU): 74.6 | MFU 7.54% | learning rate: 1.912607E-03 | global batch size: 1024 | lm loss: 6.151073E+00 | loss scale: 1.0 | grad norm: 0.608 | num zeros: 89694544.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:51:31.627428 | finish at 2025-09-10 13:04:10 + [2025-09-09 17:12:44] iteration 115/ 11920 | consumed samples: 117760 | elapsed time per iteration (ms): 5400.2 | throughput per GPU (TFLOP/s/GPU): 83.6 | MFU 8.45% | learning rate: 1.929385E-03 | global batch size: 1024 | lm loss: 6.118575E+00 | loss scale: 1.0 | grad norm: 0.355 | num zeros: 101493840.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:42:29.182388 | finish at 2025-09-10 10:55:13 + [2025-09-09 17:12:49] iteration 116/ 11920 | consumed samples: 118784 | elapsed time per iteration (ms): 5416.9 | throughput per GPU (TFLOP/s/GPU): 83.3 | MFU 8.43% | learning rate: 1.946162E-03 | global batch size: 1024 | lm loss: 6.111765E+00 | loss scale: 1.0 | grad norm: 0.560 | num zeros: 87334192.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:45:41.027550 | finish at 2025-09-10 10:58:30 + [2025-09-09 17:12:55] iteration 117/ 11920 | consumed samples: 119808 | elapsed time per iteration (ms): 5804.2 | throughput per GPU (TFLOP/s/GPU): 77.8 | MFU 7.87% | learning rate: 1.962939E-03 | global batch size: 1024 | lm loss: 6.111306E+00 | loss scale: 1.0 | grad norm: 0.408 | num zeros: 87333408.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:01:47.531818 | finish at 2025-09-10 12:14:43 + [2025-09-09 17:13:01] iteration 118/ 11920 | consumed samples: 120832 | elapsed time per iteration (ms): 5737.4 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.96% | learning rate: 1.979717E-03 | global batch size: 1024 | lm loss: 6.139060E+00 | loss scale: 1.0 | grad norm: 0.560 | num zeros: 87334432.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:48:33.079644 | finish at 2025-09-10 12:01:34 + [2025-09-09 17:13:06] iteration 119/ 11920 | consumed samples: 121856 | elapsed time per iteration (ms): 5429.2 | throughput per GPU (TFLOP/s/GPU): 83.2 | MFU 8.41% | learning rate: 1.996494E-03 | global batch size: 1024 | lm loss: 6.101973E+00 | loss scale: 1.0 | grad norm: 0.431 | num zeros: 84974944.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:47:50.413271 | finish at 2025-09-10 11:00:57 + [2025-09-09 17:13:12] iteration 120/ 11920 | consumed samples: 122880 | elapsed time per iteration (ms): 5414.8 | throughput per GPU (TFLOP/s/GPU): 83.4 | MFU 8.43% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 6.065071E+00 | loss scale: 1.0 | grad norm: 0.361 | num zeros: 80253208.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:44:55.204639 | finish at 2025-09-10 10:58:07 + [2025-09-09 17:13:17] iteration 121/ 11920 | consumed samples: 123904 | elapsed time per iteration (ms): 5425.8 | throughput per GPU (TFLOP/s/GPU): 83.2 | MFU 8.41% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 6.083042E+00 | loss scale: 1.0 | grad norm: 0.398 | num zeros: 75531872.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:46:59.470926 | finish at 2025-09-10 11:00:16 + [2025-09-09 17:13:22] iteration 122/ 11920 | consumed samples: 124928 | elapsed time per iteration (ms): 5433.3 | throughput per GPU (TFLOP/s/GPU): 83.1 | MFU 8.40% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 6.046431E+00 | loss scale: 1.0 | grad norm: 0.416 | num zeros: 82612016.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:48:22.287394 | finish at 2025-09-10 11:01:45 + [2025-09-09 17:13:28] iteration 123/ 11920 | consumed samples: 125952 | elapsed time per iteration (ms): 5429.8 | throughput per GPU (TFLOP/s/GPU): 83.1 | MFU 8.41% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 6.041301E+00 | loss scale: 1.0 | grad norm: 0.368 | num zeros: 87333624.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:47:35.640696 | finish at 2025-09-10 11:01:03 + [2025-09-09 17:13:33] iteration 124/ 11920 | consumed samples: 126976 | elapsed time per iteration (ms): 5454.0 | throughput per GPU (TFLOP/s/GPU): 82.8 | MFU 8.37% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 6.022701E+00 | loss scale: 1.0 | grad norm: 0.443 | num zeros: 115657264.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:52:15.946432 | finish at 2025-09-10 11:05:49 + [2025-09-09 17:13:39] iteration 125/ 11920 | consumed samples: 128000 | elapsed time per iteration (ms): 5467.1 | throughput per GPU (TFLOP/s/GPU): 82.6 | MFU 8.35% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 6.042458E+00 | loss scale: 1.0 | grad norm: 0.284 | num zeros: 110935168.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:54:44.330894 | finish at 2025-09-10 11:08:23 + [2025-09-09 17:13:44] iteration 126/ 11920 | consumed samples: 129024 | elapsed time per iteration (ms): 5473.1 | throughput per GPU (TFLOP/s/GPU): 82.5 | MFU 8.34% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 6.013720E+00 | loss scale: 1.0 | grad norm: 0.383 | num zeros: 94412896.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:55:49.153086 | finish at 2025-09-10 11:09:33 + [2025-09-09 17:13:50] iteration 127/ 11920 | consumed samples: 130048 | elapsed time per iteration (ms): 5487.8 | throughput per GPU (TFLOP/s/GPU): 82.3 | MFU 8.32% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.991775E+00 | loss scale: 1.0 | grad norm: 0.375 | num zeros: 82614464.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:58:37.888316 | finish at 2025-09-10 11:12:28 + [2025-09-09 17:13:55] iteration 128/ 11920 | consumed samples: 131072 | elapsed time per iteration (ms): 5692.4 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 6.010544E+00 | loss scale: 1.0 | grad norm: 0.512 | num zeros: 99138096.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:38:44.386948 | finish at 2025-09-10 11:52:40 + [2025-09-09 17:14:01] iteration 129/ 11920 | consumed samples: 132096 | elapsed time per iteration (ms): 5506.0 | throughput per GPU (TFLOP/s/GPU): 82.0 | MFU 8.29% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.990807E+00 | loss scale: 1.0 | grad norm: 0.497 | num zeros: 84974624.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:02:00.796704 | finish at 2025-09-10 11:16:02 + [2025-09-09 17:14:07] iteration 130/ 11920 | consumed samples: 133120 | elapsed time per iteration (ms): 5715.3 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.987337E+00 | loss scale: 1.0 | grad norm: 0.468 | num zeros: 92058688.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:43:03.343005 | finish at 2025-09-10 11:57:10 + [2025-09-09 17:14:12] iteration 131/ 11920 | consumed samples: 134144 | elapsed time per iteration (ms): 5491.5 | throughput per GPU (TFLOP/s/GPU): 82.2 | MFU 8.31% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.992908E+00 | loss scale: 1.0 | grad norm: 0.497 | num zeros: 87332672.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:58:59.742045 | finish at 2025-09-10 11:13:12 + [2025-09-09 17:14:18] iteration 132/ 11920 | consumed samples: 135168 | elapsed time per iteration (ms): 5497.4 | throughput per GPU (TFLOP/s/GPU): 82.1 | MFU 8.30% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.959867E+00 | loss scale: 1.0 | grad norm: 0.288 | num zeros: 103854096.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:00:03.478207 | finish at 2025-09-10 11:14:21 + [2025-09-09 17:14:23] iteration 133/ 11920 | consumed samples: 136192 | elapsed time per iteration (ms): 5506.8 | throughput per GPU (TFLOP/s/GPU): 82.0 | MFU 8.29% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.971035E+00 | loss scale: 1.0 | grad norm: 0.498 | num zeros: 101496848.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:01:48.254605 | finish at 2025-09-10 11:16:11 + [2025-09-09 17:14:29] iteration 134/ 11920 | consumed samples: 137216 | elapsed time per iteration (ms): 5507.9 | throughput per GPU (TFLOP/s/GPU): 82.0 | MFU 8.29% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.941457E+00 | loss scale: 1.0 | grad norm: 0.395 | num zeros: 89692224.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:01:55.606405 | finish at 2025-09-10 11:16:24 + [2025-09-09 17:14:34] iteration 135/ 11920 | consumed samples: 138240 | elapsed time per iteration (ms): 5500.3 | throughput per GPU (TFLOP/s/GPU): 82.1 | MFU 8.30% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.925761E+00 | loss scale: 1.0 | grad norm: 0.332 | num zeros: 92054016.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:00:20.866096 | finish at 2025-09-10 11:14:55 + [2025-09-09 17:14:40] iteration 136/ 11920 | consumed samples: 139264 | elapsed time per iteration (ms): 5496.1 | throughput per GPU (TFLOP/s/GPU): 82.1 | MFU 8.31% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.914239E+00 | loss scale: 1.0 | grad norm: 0.304 | num zeros: 80251664.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:59:26.454798 | finish at 2025-09-10 11:14:06 + [2025-09-09 17:14:45] iteration 137/ 11920 | consumed samples: 140288 | elapsed time per iteration (ms): 5506.0 | throughput per GPU (TFLOP/s/GPU): 82.0 | MFU 8.29% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.917576E+00 | loss scale: 1.0 | grad norm: 0.287 | num zeros: 80252432.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:01:17.178829 | finish at 2025-09-10 11:16:02 + [2025-09-09 17:14:51] iteration 138/ 11920 | consumed samples: 141312 | elapsed time per iteration (ms): 5495.5 | throughput per GPU (TFLOP/s/GPU): 82.2 | MFU 8.31% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.888530E+00 | loss scale: 1.0 | grad norm: 0.270 | num zeros: 68449400.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:59:07.855627 | finish at 2025-09-10 11:13:59 + [2025-09-09 17:14:56] iteration 139/ 11920 | consumed samples: 142336 | elapsed time per iteration (ms): 5502.2 | throughput per GPU (TFLOP/s/GPU): 82.1 | MFU 8.30% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.883332E+00 | loss scale: 1.0 | grad norm: 0.355 | num zeros: 66094396.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:00:21.742705 | finish at 2025-09-10 11:15:18 + [2025-09-09 17:15:02] iteration 140/ 11920 | consumed samples: 143360 | elapsed time per iteration (ms): 5502.7 | throughput per GPU (TFLOP/s/GPU): 82.0 | MFU 8.30% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.883758E+00 | loss scale: 1.0 | grad norm: 0.318 | num zeros: 68450136.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:00:21.579571 | finish at 2025-09-10 11:15:23 + [2025-09-09 17:15:07] iteration 141/ 11920 | consumed samples: 144384 | elapsed time per iteration (ms): 5523.0 | throughput per GPU (TFLOP/s/GPU): 81.7 | MFU 8.27% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.875756E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 68450224.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:04:15.256948 | finish at 2025-09-10 11:19:22 + [2025-09-09 17:15:13] iteration 142/ 11920 | consumed samples: 145408 | elapsed time per iteration (ms): 5520.2 | throughput per GPU (TFLOP/s/GPU): 81.8 | MFU 8.27% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.865431E+00 | loss scale: 1.0 | grad norm: 0.280 | num zeros: 75530288.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:03:37.415607 | finish at 2025-09-10 11:18:50 + [2025-09-09 17:15:18] iteration 143/ 11920 | consumed samples: 146432 | elapsed time per iteration (ms): 5505.6 | throughput per GPU (TFLOP/s/GPU): 82.0 | MFU 8.29% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.856924E+00 | loss scale: 1.0 | grad norm: 0.254 | num zeros: 66089160.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:00:39.815934 | finish at 2025-09-10 11:15:58 + [2025-09-09 17:15:24] iteration 144/ 11920 | consumed samples: 147456 | elapsed time per iteration (ms): 5519.3 | throughput per GPU (TFLOP/s/GPU): 81.8 | MFU 8.27% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.840334E+00 | loss scale: 1.0 | grad norm: 0.249 | num zeros: 56648472.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:03:14.729126 | finish at 2025-09-10 11:18:38 + [2025-09-09 17:15:29] iteration 145/ 11920 | consumed samples: 148480 | elapsed time per iteration (ms): 5490.0 | throughput per GPU (TFLOP/s/GPU): 82.2 | MFU 8.32% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.845733E+00 | loss scale: 1.0 | grad norm: 0.393 | num zeros: 73170024.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:57:24.354272 | finish at 2025-09-10 11:12:54 + [2025-09-09 17:15:35] iteration 146/ 11920 | consumed samples: 149504 | elapsed time per iteration (ms): 5493.5 | throughput per GPU (TFLOP/s/GPU): 82.2 | MFU 8.31% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.829519E+00 | loss scale: 1.0 | grad norm: 0.439 | num zeros: 51927068.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:58:00.915268 | finish at 2025-09-10 11:13:36 + [2025-09-09 17:15:40] iteration 147/ 11920 | consumed samples: 150528 | elapsed time per iteration (ms): 5758.5 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.822924E+00 | loss scale: 1.0 | grad norm: 0.288 | num zeros: 51930248.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:49:55.388662 | finish at 2025-09-10 12:05:36 + [2025-09-09 17:15:46] iteration 148/ 11920 | consumed samples: 151552 | elapsed time per iteration (ms): 5462.4 | throughput per GPU (TFLOP/s/GPU): 82.7 | MFU 8.36% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.818906E+00 | loss scale: 1.0 | grad norm: 0.255 | num zeros: 40127916.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:51:43.302157 | finish at 2025-09-10 11:07:29 + [2025-09-09 17:15:51] iteration 149/ 11920 | consumed samples: 152576 | elapsed time per iteration (ms): 5469.6 | throughput per GPU (TFLOP/s/GPU): 82.5 | MFU 8.35% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.801161E+00 | loss scale: 1.0 | grad norm: 0.324 | num zeros: 40125488.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:53:02.329997 | finish at 2025-09-10 11:08:54 + [2025-09-09 17:15:57] iteration 150/ 11920 | consumed samples: 153600 | elapsed time per iteration (ms): 5454.7 | throughput per GPU (TFLOP/s/GPU): 82.8 | MFU 8.37% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.804355E+00 | loss scale: 1.0 | grad norm: 0.281 | num zeros: 40127000.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:50:01.945198 | finish at 2025-09-10 11:05:59 + [2025-09-09 17:16:02] iteration 151/ 11920 | consumed samples: 154624 | elapsed time per iteration (ms): 5477.8 | throughput per GPU (TFLOP/s/GPU): 82.4 | MFU 8.33% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.776087E+00 | loss scale: 1.0 | grad norm: 0.316 | num zeros: 44846164.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:54:27.789206 | finish at 2025-09-10 11:10:30 + [2025-09-09 17:16:08] iteration 152/ 11920 | consumed samples: 155648 | elapsed time per iteration (ms): 5449.8 | throughput per GPU (TFLOP/s/GPU): 82.8 | MFU 8.38% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.787704E+00 | loss scale: 1.0 | grad norm: 0.337 | num zeros: 37765208.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:48:53.319519 | finish at 2025-09-10 11:05:01 + [2025-09-09 17:16:13] iteration 153/ 11920 | consumed samples: 156672 | elapsed time per iteration (ms): 5462.0 | throughput per GPU (TFLOP/s/GPU): 82.7 | MFU 8.36% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.771608E+00 | loss scale: 1.0 | grad norm: 0.349 | num zeros: 49566724.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:51:11.835284 | finish at 2025-09-10 11:07:25 + [2025-09-09 17:16:19] iteration 154/ 11920 | consumed samples: 157696 | elapsed time per iteration (ms): 5477.1 | throughput per GPU (TFLOP/s/GPU): 82.4 | MFU 8.33% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.756892E+00 | loss scale: 1.0 | grad norm: 0.290 | num zeros: 44846100.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:54:04.138053 | finish at 2025-09-10 11:10:23 + [2025-09-09 17:16:24] iteration 155/ 11920 | consumed samples: 158720 | elapsed time per iteration (ms): 5449.9 | throughput per GPU (TFLOP/s/GPU): 82.8 | MFU 8.38% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.760955E+00 | loss scale: 1.0 | grad norm: 0.290 | num zeros: 51927080.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:48:38.608217 | finish at 2025-09-10 11:05:03 + [2025-09-09 17:16:30] iteration 156/ 11920 | consumed samples: 159744 | elapsed time per iteration (ms): 5451.9 | throughput per GPU (TFLOP/s/GPU): 82.8 | MFU 8.37% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.744971E+00 | loss scale: 1.0 | grad norm: 0.274 | num zeros: 44846164.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:48:56.359215 | finish at 2025-09-10 11:05:26 + [2025-09-09 17:16:35] iteration 157/ 11920 | consumed samples: 160768 | elapsed time per iteration (ms): 5464.3 | throughput per GPU (TFLOP/s/GPU): 82.6 | MFU 8.35% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.737560E+00 | loss scale: 1.0 | grad norm: 0.229 | num zeros: 42488872.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:51:16.576753 | finish at 2025-09-10 11:07:52 + [2025-09-09 17:16:41] iteration 158/ 11920 | consumed samples: 161792 | elapsed time per iteration (ms): 5445.5 | throughput per GPU (TFLOP/s/GPU): 82.9 | MFU 8.38% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.721798E+00 | loss scale: 1.0 | grad norm: 0.262 | num zeros: 47207976.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:47:29.521104 | finish at 2025-09-10 11:04:10 + [2025-09-09 17:16:46] iteration 159/ 11920 | consumed samples: 162816 | elapsed time per iteration (ms): 5449.0 | throughput per GPU (TFLOP/s/GPU): 82.9 | MFU 8.38% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.739238E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 47207952.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:48:05.241767 | finish at 2025-09-10 11:04:51 + [2025-09-09 17:16:51] iteration 160/ 11920 | consumed samples: 163840 | elapsed time per iteration (ms): 5453.6 | throughput per GPU (TFLOP/s/GPU): 82.8 | MFU 8.37% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.714969E+00 | loss scale: 1.0 | grad norm: 0.468 | num zeros: 44849176.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:48:54.741726 | finish at 2025-09-10 11:05:46 + [2025-09-09 17:16:57] iteration 161/ 11920 | consumed samples: 164864 | elapsed time per iteration (ms): 5463.6 | throughput per GPU (TFLOP/s/GPU): 82.6 | MFU 8.36% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.720201E+00 | loss scale: 1.0 | grad norm: 0.448 | num zeros: 35407116.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:50:46.816301 | finish at 2025-09-10 11:07:44 + [2025-09-09 17:17:02] iteration 162/ 11920 | consumed samples: 165888 | elapsed time per iteration (ms): 5470.2 | throughput per GPU (TFLOP/s/GPU): 82.5 | MFU 8.35% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.715709E+00 | loss scale: 1.0 | grad norm: 0.633 | num zeros: 44847696.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:51:58.519817 | finish at 2025-09-10 11:09:01 + [2025-09-09 17:17:08] iteration 163/ 11920 | consumed samples: 166912 | elapsed time per iteration (ms): 5553.5 | throughput per GPU (TFLOP/s/GPU): 81.3 | MFU 8.22% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.726972E+00 | loss scale: 1.0 | grad norm: 0.393 | num zeros: 33044596.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:08:12.296938 | finish at 2025-09-10 11:25:20 + [2025-09-09 17:17:13] iteration 164/ 11920 | consumed samples: 167936 | elapsed time per iteration (ms): 5467.4 | throughput per GPU (TFLOP/s/GPU): 82.6 | MFU 8.35% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.727459E+00 | loss scale: 1.0 | grad norm: 0.434 | num zeros: 42485820.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:51:14.976695 | finish at 2025-09-10 11:08:28 + [2025-09-09 17:17:19] iteration 165/ 11920 | consumed samples: 168960 | elapsed time per iteration (ms): 5489.0 | throughput per GPU (TFLOP/s/GPU): 82.3 | MFU 8.32% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.691818E+00 | loss scale: 1.0 | grad norm: 0.472 | num zeros: 40125472.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:55:23.436989 | finish at 2025-09-10 11:12:42 + [2025-09-09 17:17:24] iteration 166/ 11920 | consumed samples: 169984 | elapsed time per iteration (ms): 5496.0 | throughput per GPU (TFLOP/s/GPU): 82.1 | MFU 8.31% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.687760E+00 | loss scale: 1.0 | grad norm: 0.250 | num zeros: 23604778.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:56:39.496993 | finish at 2025-09-10 11:14:04 + [2025-09-09 17:17:30] iteration 167/ 11920 | consumed samples: 171008 | elapsed time per iteration (ms): 5473.5 | throughput per GPU (TFLOP/s/GPU): 82.5 | MFU 8.34% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.686690E+00 | loss scale: 1.0 | grad norm: 0.360 | num zeros: 33044544.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:52:09.975604 | finish at 2025-09-10 11:09:40 + [2025-09-09 17:17:36] iteration 168/ 11920 | consumed samples: 172032 | elapsed time per iteration (ms): 5719.0 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.673030E+00 | loss scale: 1.0 | grad norm: 0.399 | num zeros: 28323844.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:40:09.778849 | finish at 2025-09-10 11:57:45 + [2025-09-09 17:17:42] iteration 169/ 11920 | consumed samples: 173056 | elapsed time per iteration (ms): 6021.9 | throughput per GPU (TFLOP/s/GPU): 75.0 | MFU 7.58% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.696536E+00 | loss scale: 1.0 | grad norm: 0.701 | num zeros: 49566728.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:39:23.542294 | finish at 2025-09-10 12:57:05 + [2025-09-09 17:17:47] iteration 170/ 11920 | consumed samples: 174080 | elapsed time per iteration (ms): 5686.7 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.672734E+00 | loss scale: 1.0 | grad norm: 0.300 | num zeros: 33044528.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:33:39.065213 | finish at 2025-09-10 11:51:26 + [2025-09-09 17:17:53] iteration 171/ 11920 | consumed samples: 175104 | elapsed time per iteration (ms): 5523.9 | throughput per GPU (TFLOP/s/GPU): 81.7 | MFU 8.26% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.658221E+00 | loss scale: 1.0 | grad norm: 0.470 | num zeros: 28323944.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:01:40.082984 | finish at 2025-09-10 11:19:33 + [2025-09-09 17:17:58] iteration 172/ 11920 | consumed samples: 176128 | elapsed time per iteration (ms): 5462.4 | throughput per GPU (TFLOP/s/GPU): 82.7 | MFU 8.36% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.689199E+00 | loss scale: 1.0 | grad norm: 0.624 | num zeros: 30684172.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:49:32.748084 | finish at 2025-09-10 11:07:31 + [2025-09-09 17:18:04] iteration 173/ 11920 | consumed samples: 177152 | elapsed time per iteration (ms): 5481.5 | throughput per GPU (TFLOP/s/GPU): 82.4 | MFU 8.33% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.659139E+00 | loss scale: 1.0 | grad norm: 0.302 | num zeros: 25963544.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:53:10.638911 | finish at 2025-09-10 11:11:14 + [2025-09-09 17:18:09] iteration 174/ 11920 | consumed samples: 178176 | elapsed time per iteration (ms): 5497.6 | throughput per GPU (TFLOP/s/GPU): 82.1 | MFU 8.30% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.636106E+00 | loss scale: 1.0 | grad norm: 0.401 | num zeros: 23603264.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:56:14.911340 | finish at 2025-09-10 11:14:24 + [2025-09-09 17:18:15] iteration 175/ 11920 | consumed samples: 179200 | elapsed time per iteration (ms): 5477.2 | throughput per GPU (TFLOP/s/GPU): 82.4 | MFU 8.33% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.626627E+00 | loss scale: 1.0 | grad norm: 0.274 | num zeros: 21245978.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:52:09.227128 | finish at 2025-09-10 11:10:24 + [2025-09-09 17:18:20] iteration 176/ 11920 | consumed samples: 180224 | elapsed time per iteration (ms): 5469.9 | throughput per GPU (TFLOP/s/GPU): 82.5 | MFU 8.35% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.639540E+00 | loss scale: 1.0 | grad norm: 0.345 | num zeros: 21244428.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:50:38.694740 | finish at 2025-09-10 11:08:59 + [2025-09-09 17:18:26] iteration 177/ 11920 | consumed samples: 181248 | elapsed time per iteration (ms): 5477.0 | throughput per GPU (TFLOP/s/GPU): 82.4 | MFU 8.33% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.626214E+00 | loss scale: 1.0 | grad norm: 0.364 | num zeros: 23603234.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:51:56.836540 | finish at 2025-09-10 11:10:23 + [2025-09-09 17:18:31] iteration 178/ 11920 | consumed samples: 182272 | elapsed time per iteration (ms): 5494.4 | throughput per GPU (TFLOP/s/GPU): 82.2 | MFU 8.31% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.617796E+00 | loss scale: 1.0 | grad norm: 0.379 | num zeros: 21244460.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:55:15.088314 | finish at 2025-09-10 11:13:46 + [2025-09-09 17:18:37] iteration 179/ 11920 | consumed samples: 183296 | elapsed time per iteration (ms): 5458.6 | throughput per GPU (TFLOP/s/GPU): 82.7 | MFU 8.36% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.632932E+00 | loss scale: 1.0 | grad norm: 0.385 | num zeros: 21242966.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:48:08.927648 | finish at 2025-09-10 11:06:46 + [2025-09-09 17:18:42] iteration 180/ 11920 | consumed samples: 184320 | elapsed time per iteration (ms): 5485.4 | throughput per GPU (TFLOP/s/GPU): 82.3 | MFU 8.32% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.606768E+00 | loss scale: 1.0 | grad norm: 0.322 | num zeros: 21245328.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:53:18.340836 | finish at 2025-09-10 11:12:00 + [2025-09-09 17:18:48] iteration 181/ 11920 | consumed samples: 185344 | elapsed time per iteration (ms): 5484.0 | throughput per GPU (TFLOP/s/GPU): 82.3 | MFU 8.32% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.586701E+00 | loss scale: 1.0 | grad norm: 0.349 | num zeros: 23603272.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:52:56.935907 | finish at 2025-09-10 11:11:45 + [2025-09-09 17:18:53] iteration 182/ 11920 | consumed samples: 186368 | elapsed time per iteration (ms): 5480.1 | throughput per GPU (TFLOP/s/GPU): 82.4 | MFU 8.33% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.570782E+00 | loss scale: 1.0 | grad norm: 0.262 | num zeros: 30684194.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:52:05.267294 | finish at 2025-09-10 11:10:58 + [2025-09-09 17:18:59] iteration 183/ 11920 | consumed samples: 187392 | elapsed time per iteration (ms): 5485.8 | throughput per GPU (TFLOP/s/GPU): 82.3 | MFU 8.32% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.570769E+00 | loss scale: 1.0 | grad norm: 0.268 | num zeros: 21243652.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:53:06.952456 | finish at 2025-09-10 11:12:06 + [2025-09-09 17:19:04] iteration 184/ 11920 | consumed samples: 188416 | elapsed time per iteration (ms): 5494.6 | throughput per GPU (TFLOP/s/GPU): 82.2 | MFU 8.31% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.566920E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 28323880.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:54:45.157911 | finish at 2025-09-10 11:13:49 + [2025-09-09 17:19:10] iteration 185/ 11920 | consumed samples: 189440 | elapsed time per iteration (ms): 5495.5 | throughput per GPU (TFLOP/s/GPU): 82.2 | MFU 8.31% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.551391E+00 | loss scale: 1.0 | grad norm: 0.268 | num zeros: 14163487.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:54:49.953729 | finish at 2025-09-10 11:14:00 + [2025-09-09 17:19:15] iteration 186/ 11920 | consumed samples: 190464 | elapsed time per iteration (ms): 5815.6 | throughput per GPU (TFLOP/s/GPU): 77.6 | MFU 7.85% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.545220E+00 | loss scale: 1.0 | grad norm: 0.336 | num zeros: 23604746.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:57:20.389322 | finish at 2025-09-10 12:16:36 + [2025-09-09 17:19:21] iteration 187/ 11920 | consumed samples: 191488 | elapsed time per iteration (ms): 5493.9 | throughput per GPU (TFLOP/s/GPU): 82.2 | MFU 8.31% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.542905E+00 | loss scale: 1.0 | grad norm: 0.251 | num zeros: 18882608.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:54:19.520996 | finish at 2025-09-10 11:13:40 + [2025-09-09 17:19:26] iteration 188/ 11920 | consumed samples: 192512 | elapsed time per iteration (ms): 5516.2 | throughput per GPU (TFLOP/s/GPU): 81.8 | MFU 8.28% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.534461E+00 | loss scale: 1.0 | grad norm: 0.307 | num zeros: 18882652.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:58:36.585030 | finish at 2025-09-10 11:18:03 + [2025-09-09 17:19:32] iteration 189/ 11920 | consumed samples: 193536 | elapsed time per iteration (ms): 5493.8 | throughput per GPU (TFLOP/s/GPU): 82.2 | MFU 8.31% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.524486E+00 | loss scale: 1.0 | grad norm: 0.316 | num zeros: 18882656.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:54:08.007451 | finish at 2025-09-10 11:13:40 + [2025-09-09 17:19:37] iteration 190/ 11920 | consumed samples: 194560 | elapsed time per iteration (ms): 5509.5 | throughput per GPU (TFLOP/s/GPU): 81.9 | MFU 8.29% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.521518E+00 | loss scale: 1.0 | grad norm: 0.230 | num zeros: 18882566.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:57:05.920730 | finish at 2025-09-10 11:16:43 + [2025-09-09 17:19:43] iteration 191/ 11920 | consumed samples: 195584 | elapsed time per iteration (ms): 5532.8 | throughput per GPU (TFLOP/s/GPU): 81.6 | MFU 8.25% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.502988E+00 | loss scale: 1.0 | grad norm: 0.268 | num zeros: 14161923.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:01:33.657032 | finish at 2025-09-10 11:21:17 + [2025-09-09 17:19:48] iteration 192/ 11920 | consumed samples: 196608 | elapsed time per iteration (ms): 5518.5 | throughput per GPU (TFLOP/s/GPU): 81.8 | MFU 8.27% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.534382E+00 | loss scale: 1.0 | grad norm: 0.330 | num zeros: 21242916.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:58:40.471333 | finish at 2025-09-10 11:18:29 + [2025-09-09 17:19:54] iteration 193/ 11920 | consumed samples: 197632 | elapsed time per iteration (ms): 5547.4 | throughput per GPU (TFLOP/s/GPU): 81.4 | MFU 8.23% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.497231E+00 | loss scale: 1.0 | grad norm: 0.397 | num zeros: 11802400.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:04:14.756796 | finish at 2025-09-10 11:24:09 + [2025-09-09 17:19:59] iteration 194/ 11920 | consumed samples: 198656 | elapsed time per iteration (ms): 5517.1 | throughput per GPU (TFLOP/s/GPU): 81.8 | MFU 8.27% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.519269E+00 | loss scale: 1.0 | grad norm: 0.465 | num zeros: 16525496.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:58:13.482174 | finish at 2025-09-10 11:18:13 + [2025-09-09 17:20:05] iteration 195/ 11920 | consumed samples: 199680 | elapsed time per iteration (ms): 5522.2 | throughput per GPU (TFLOP/s/GPU): 81.8 | MFU 8.27% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.528417E+00 | loss scale: 1.0 | grad norm: 0.426 | num zeros: 14161963.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:59:08.170853 | finish at 2025-09-10 11:19:13 + [2025-09-09 17:20:11] iteration 196/ 11920 | consumed samples: 200704 | elapsed time per iteration (ms): 5495.8 | throughput per GPU (TFLOP/s/GPU): 82.2 | MFU 8.31% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.523459E+00 | loss scale: 1.0 | grad norm: 0.444 | num zeros: 14161926.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:53:52.667172 | finish at 2025-09-10 11:14:03 + [2025-09-09 17:20:16] iteration 197/ 11920 | consumed samples: 201728 | elapsed time per iteration (ms): 5474.6 | throughput per GPU (TFLOP/s/GPU): 82.5 | MFU 8.34% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.519552E+00 | loss scale: 1.0 | grad norm: 0.556 | num zeros: 21242886.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:49:38.689184 | finish at 2025-09-10 11:09:55 + [2025-09-09 17:20:21] iteration 198/ 11920 | consumed samples: 202752 | elapsed time per iteration (ms): 5496.6 | throughput per GPU (TFLOP/s/GPU): 82.1 | MFU 8.31% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.524850E+00 | loss scale: 1.0 | grad norm: 0.395 | num zeros: 11801608.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:53:51.683561 | finish at 2025-09-10 11:14:13 + [2025-09-09 17:20:27] iteration 199/ 11920 | consumed samples: 203776 | elapsed time per iteration (ms): 5499.8 | throughput per GPU (TFLOP/s/GPU): 82.1 | MFU 8.30% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.514679E+00 | loss scale: 1.0 | grad norm: 0.394 | num zeros: 4726025.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:54:23.370588 | finish at 2025-09-10 11:14:50 + [2025-09-09 17:20:32] iteration 200/ 11920 | consumed samples: 204800 | elapsed time per iteration (ms): 5476.7 | throughput per GPU (TFLOP/s/GPU): 82.4 | MFU 8.34% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.492636E+00 | loss scale: 1.0 | grad norm: 0.388 | num zeros: 9442862.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:49:46.952734 | finish at 2025-09-10 11:10:19 + [2025-09-09 17:20:38] iteration 201/ 11920 | consumed samples: 205824 | elapsed time per iteration (ms): 5486.4 | throughput per GPU (TFLOP/s/GPU): 82.3 | MFU 8.32% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.470498E+00 | loss scale: 1.0 | grad norm: 0.350 | num zeros: 9441319.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:51:35.561755 | finish at 2025-09-10 11:12:14 + [2025-09-09 17:20:43] iteration 202/ 11920 | consumed samples: 206848 | elapsed time per iteration (ms): 5477.1 | throughput per GPU (TFLOP/s/GPU): 82.4 | MFU 8.33% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.461176E+00 | loss scale: 1.0 | grad norm: 0.270 | num zeros: 9441315.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:49:41.072850 | finish at 2025-09-10 11:10:25 + [2025-09-09 17:20:49] iteration 203/ 11920 | consumed samples: 207872 | elapsed time per iteration (ms): 5478.8 | throughput per GPU (TFLOP/s/GPU): 82.4 | MFU 8.33% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.455997E+00 | loss scale: 1.0 | grad norm: 0.268 | num zeros: 4721414.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:49:54.818135 | finish at 2025-09-10 11:10:44 + [2025-09-09 17:20:54] iteration 204/ 11920 | consumed samples: 208896 | elapsed time per iteration (ms): 5481.2 | throughput per GPU (TFLOP/s/GPU): 82.4 | MFU 8.33% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.438308E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 4720670.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:50:17.683097 | finish at 2025-09-10 11:11:12 + [2025-09-09 17:21:00] iteration 205/ 11920 | consumed samples: 209920 | elapsed time per iteration (ms): 5482.4 | throughput per GPU (TFLOP/s/GPU): 82.4 | MFU 8.33% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.443638E+00 | loss scale: 1.0 | grad norm: 0.269 | num zeros: 4720647.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:50:26.449370 | finish at 2025-09-10 11:11:26 + [2025-09-09 17:21:05] iteration 206/ 11920 | consumed samples: 210944 | elapsed time per iteration (ms): 5479.5 | throughput per GPU (TFLOP/s/GPU): 82.4 | MFU 8.33% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.416642E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 2361870.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:49:47.249059 | finish at 2025-09-10 11:10:53 + [2025-09-09 17:21:11] iteration 207/ 11920 | consumed samples: 211968 | elapsed time per iteration (ms): 5480.6 | throughput per GPU (TFLOP/s/GPU): 82.4 | MFU 8.33% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.405033E+00 | loss scale: 1.0 | grad norm: 0.240 | num zeros: 7082505.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:49:54.263604 | finish at 2025-09-10 11:11:05 + [2025-09-09 17:21:16] iteration 208/ 11920 | consumed samples: 212992 | elapsed time per iteration (ms): 5478.9 | throughput per GPU (TFLOP/s/GPU): 82.4 | MFU 8.33% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.403975E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 7080969.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:49:29.068954 | finish at 2025-09-10 11:10:45 + [2025-09-09 17:21:22] iteration 209/ 11920 | consumed samples: 214016 | elapsed time per iteration (ms): 5487.5 | throughput per GPU (TFLOP/s/GPU): 82.3 | MFU 8.32% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.415440E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 7080969.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:51:03.600984 | finish at 2025-09-10 11:12:25 + [2025-09-09 17:21:27] iteration 210/ 11920 | consumed samples: 215040 | elapsed time per iteration (ms): 5473.9 | throughput per GPU (TFLOP/s/GPU): 82.5 | MFU 8.34% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.392186E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 9441311.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:48:19.450898 | finish at 2025-09-10 11:09:47 + [2025-09-09 17:21:33] iteration 211/ 11920 | consumed samples: 216064 | elapsed time per iteration (ms): 5506.6 | throughput per GPU (TFLOP/s/GPU): 82.0 | MFU 8.29% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.402669E+00 | loss scale: 1.0 | grad norm: 0.315 | num zeros: 7081008.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:54:36.848057 | finish at 2025-09-10 11:16:10 + [2025-09-09 17:21:38] iteration 212/ 11920 | consumed samples: 217088 | elapsed time per iteration (ms): 5491.3 | throughput per GPU (TFLOP/s/GPU): 82.2 | MFU 8.31% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.431968E+00 | loss scale: 1.0 | grad norm: 0.593 | num zeros: 11801652.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:51:31.748054 | finish at 2025-09-10 11:13:10 + [2025-09-09 17:21:44] iteration 213/ 11920 | consumed samples: 218112 | elapsed time per iteration (ms): 5523.3 | throughput per GPU (TFLOP/s/GPU): 81.7 | MFU 8.27% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.407377E+00 | loss scale: 1.0 | grad norm: 0.468 | num zeros: 4722188.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:57:40.968072 | finish at 2025-09-10 11:19:25 + [2025-09-09 17:21:49] iteration 214/ 11920 | consumed samples: 219136 | elapsed time per iteration (ms): 5507.0 | throughput per GPU (TFLOP/s/GPU): 82.0 | MFU 8.29% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.409814E+00 | loss scale: 1.0 | grad norm: 0.301 | num zeros: 14163516.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:54:25.449592 | finish at 2025-09-10 11:16:15 + [2025-09-09 17:21:55] iteration 215/ 11920 | consumed samples: 220160 | elapsed time per iteration (ms): 5560.9 | throughput per GPU (TFLOP/s/GPU): 81.2 | MFU 8.21% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.363749E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 11801608.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:04:50.406741 | finish at 2025-09-10 11:26:45 + [2025-09-09 17:22:00] iteration 216/ 11920 | consumed samples: 221184 | elapsed time per iteration (ms): 5561.3 | throughput per GPU (TFLOP/s/GPU): 81.2 | MFU 8.21% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.373253E+00 | loss scale: 1.0 | grad norm: 0.309 | num zeros: 4720650.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:04:49.743076 | finish at 2025-09-10 11:26:50 + [2025-09-09 17:22:06] iteration 217/ 11920 | consumed samples: 222208 | elapsed time per iteration (ms): 5550.0 | throughput per GPU (TFLOP/s/GPU): 81.3 | MFU 8.23% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.375427E+00 | loss scale: 1.0 | grad norm: 0.286 | num zeros: 9442825.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:02:32.067974 | finish at 2025-09-10 11:24:38 + [2025-09-09 17:22:12] iteration 218/ 11920 | consumed samples: 223232 | elapsed time per iteration (ms): 5565.1 | throughput per GPU (TFLOP/s/GPU): 81.1 | MFU 8.20% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.356729E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 7082522.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:05:23.282334 | finish at 2025-09-10 11:27:35 + [2025-09-09 17:22:17] iteration 219/ 11920 | consumed samples: 224256 | elapsed time per iteration (ms): 5560.5 | throughput per GPU (TFLOP/s/GPU): 81.2 | MFU 8.21% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.360250E+00 | loss scale: 1.0 | grad norm: 0.247 | num zeros: 4720643.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:04:23.484729 | finish at 2025-09-10 11:26:41 + [2025-09-09 17:22:23] iteration 220/ 11920 | consumed samples: 225280 | elapsed time per iteration (ms): 5569.2 | throughput per GPU (TFLOP/s/GPU): 81.1 | MFU 8.20% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.373465E+00 | loss scale: 1.0 | grad norm: 0.286 | num zeros: 4720655.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:05:59.548402 | finish at 2025-09-10 11:28:22 + [2025-09-09 17:22:28] iteration 221/ 11920 | consumed samples: 226304 | elapsed time per iteration (ms): 5554.8 | throughput per GPU (TFLOP/s/GPU): 81.3 | MFU 8.22% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.376877E+00 | loss scale: 1.0 | grad norm: 0.376 | num zeros: 4720642.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:03:05.346192 | finish at 2025-09-10 11:25:34 + [2025-09-09 17:22:34] iteration 222/ 11920 | consumed samples: 227328 | elapsed time per iteration (ms): 5573.5 | throughput per GPU (TFLOP/s/GPU): 81.0 | MFU 8.19% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.388557E+00 | loss scale: 1.0 | grad norm: 0.556 | num zeros: 4720708.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:06:38.648644 | finish at 2025-09-10 11:29:12 + [2025-09-09 17:22:39] iteration 223/ 11920 | consumed samples: 228352 | elapsed time per iteration (ms): 5541.4 | throughput per GPU (TFLOP/s/GPU): 81.5 | MFU 8.24% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.358445E+00 | loss scale: 1.0 | grad norm: 0.329 | num zeros: 11801623.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:00:18.243319 | finish at 2025-09-10 11:22:58 + [2025-09-09 17:22:45] iteration 224/ 11920 | consumed samples: 229376 | elapsed time per iteration (ms): 5552.2 | throughput per GPU (TFLOP/s/GPU): 81.3 | MFU 8.22% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.350267E+00 | loss scale: 1.0 | grad norm: 0.312 | num zeros: 9441325.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:02:18.955982 | finish at 2025-09-10 11:25:04 + [2025-09-09 17:22:50] iteration 225/ 11920 | consumed samples: 230400 | elapsed time per iteration (ms): 5569.4 | throughput per GPU (TFLOP/s/GPU): 81.1 | MFU 8.20% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.338989E+00 | loss scale: 1.0 | grad norm: 0.379 | num zeros: 7082500.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:05:34.309506 | finish at 2025-09-10 11:28:25 + [2025-09-09 17:22:56] iteration 226/ 11920 | consumed samples: 231424 | elapsed time per iteration (ms): 5533.0 | throughput per GPU (TFLOP/s/GPU): 81.6 | MFU 8.25% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.333497E+00 | loss scale: 1.0 | grad norm: 0.248 | num zeros: 7081004.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:58:23.150049 | finish at 2025-09-10 11:21:19 + [2025-09-09 17:23:02] iteration 227/ 11920 | consumed samples: 232448 | elapsed time per iteration (ms): 5579.8 | throughput per GPU (TFLOP/s/GPU): 80.9 | MFU 8.18% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.326734E+00 | loss scale: 1.0 | grad norm: 0.336 | num zeros: 7081006.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:07:24.023039 | finish at 2025-09-10 11:30:26 + [2025-09-09 17:23:07] iteration 228/ 11920 | consumed samples: 233472 | elapsed time per iteration (ms): 5582.4 | throughput per GPU (TFLOP/s/GPU): 80.9 | MFU 8.18% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.323369E+00 | loss scale: 1.0 | grad norm: 0.263 | num zeros: 9441308.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:07:49.611333 | finish at 2025-09-10 11:30:57 + [2025-09-09 17:23:13] iteration 229/ 11920 | consumed samples: 234496 | elapsed time per iteration (ms): 5575.5 | throughput per GPU (TFLOP/s/GPU): 81.0 | MFU 8.19% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.306180E+00 | loss scale: 1.0 | grad norm: 0.314 | num zeros: 7080966.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:06:22.889111 | finish at 2025-09-10 11:29:36 + [2025-09-09 17:23:18] iteration 230/ 11920 | consumed samples: 235520 | elapsed time per iteration (ms): 5570.1 | throughput per GPU (TFLOP/s/GPU): 81.1 | MFU 8.20% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.312059E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 7080960.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:05:14.152076 | finish at 2025-09-10 11:28:32 + [2025-09-09 17:23:24] iteration 231/ 11920 | consumed samples: 236544 | elapsed time per iteration (ms): 5604.6 | throughput per GPU (TFLOP/s/GPU): 80.6 | MFU 8.15% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.278049E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 7080961.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:11:52.107536 | finish at 2025-09-10 11:35:16 + [2025-09-09 17:23:30] iteration 232/ 11920 | consumed samples: 237568 | elapsed time per iteration (ms): 5978.0 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.275296E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 7082499.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:24:30.294657 | finish at 2025-09-10 12:48:00 + [2025-09-09 17:23:36] iteration 233/ 11920 | consumed samples: 238592 | elapsed time per iteration (ms): 5824.8 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.84% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.281241E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 4722177.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:54:34.613312 | finish at 2025-09-10 12:18:10 + [2025-09-09 17:23:41] iteration 234/ 11920 | consumed samples: 239616 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.270522E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 7080963.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:14:35.034474 | finish at 2025-09-10 11:38:16 + [2025-09-09 17:23:47] iteration 235/ 11920 | consumed samples: 240640 | elapsed time per iteration (ms): 5589.9 | throughput per GPU (TFLOP/s/GPU): 80.8 | MFU 8.17% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.268099E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 7080971.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:08:37.639028 | finish at 2025-09-10 11:32:25 + [2025-09-09 17:23:53] iteration 236/ 11920 | consumed samples: 241664 | elapsed time per iteration (ms): 5637.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.269321E+00 | loss scale: 1.0 | grad norm: 0.286 | num zeros: 4720641.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:17:50.116111 | finish at 2025-09-10 11:41:43 + [2025-09-09 17:23:58] iteration 237/ 11920 | consumed samples: 242688 | elapsed time per iteration (ms): 5824.3 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.84% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.284074E+00 | loss scale: 1.0 | grad norm: 0.382 | num zeros: 2361120.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:54:05.219500 | finish at 2025-09-10 12:18:04 + [2025-09-09 17:24:04] iteration 238/ 11920 | consumed samples: 243712 | elapsed time per iteration (ms): 5950.8 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.298491E+00 | loss scale: 1.0 | grad norm: 0.457 | num zeros: 2360322.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:18:37.757401 | finish at 2025-09-10 12:42:42 + [2025-09-09 17:24:10] iteration 239/ 11920 | consumed samples: 244736 | elapsed time per iteration (ms): 5905.6 | throughput per GPU (TFLOP/s/GPU): 76.5 | MFU 7.73% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.297354E+00 | loss scale: 1.0 | grad norm: 0.439 | num zeros: 2360323.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:09:43.773948 | finish at 2025-09-10 12:33:54 + [2025-09-09 17:24:16] iteration 240/ 11920 | consumed samples: 245760 | elapsed time per iteration (ms): 5598.3 | throughput per GPU (TFLOP/s/GPU): 80.6 | MFU 8.15% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.270654E+00 | loss scale: 1.0 | grad norm: 0.317 | num zeros: 2360323.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:09:48.358192 | finish at 2025-09-10 11:34:04 + [2025-09-09 17:24:22] iteration 241/ 11920 | consumed samples: 246784 | elapsed time per iteration (ms): 6096.3 | throughput per GPU (TFLOP/s/GPU): 74.1 | MFU 7.49% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.279941E+00 | loss scale: 1.0 | grad norm: 0.440 | num zeros: 69.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:46:38.597273 | finish at 2025-09-10 13:11:01 + [2025-09-09 17:24:28] iteration 242/ 11920 | consumed samples: 247808 | elapsed time per iteration (ms): 5617.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.264757E+00 | loss scale: 1.0 | grad norm: 0.365 | num zeros: 2360320.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:13:20.241408 | finish at 2025-09-10 11:37:48 + [2025-09-09 17:24:33] iteration 243/ 11920 | consumed samples: 248832 | elapsed time per iteration (ms): 5839.5 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.282854E+00 | loss scale: 1.0 | grad norm: 0.482 | num zeros: 2360320.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:56:27.779673 | finish at 2025-09-10 12:21:01 + [2025-09-09 17:24:39] iteration 244/ 11920 | consumed samples: 249856 | elapsed time per iteration (ms): 5598.8 | throughput per GPU (TFLOP/s/GPU): 80.6 | MFU 8.15% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.285427E+00 | loss scale: 1.0 | grad norm: 0.506 | num zeros: 2360320.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:09:31.802496 | finish at 2025-09-10 11:34:11 + [2025-09-09 17:24:45] iteration 245/ 11920 | consumed samples: 250880 | elapsed time per iteration (ms): 5618.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.270439E+00 | loss scale: 1.0 | grad norm: 0.355 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:13:13.198329 | finish at 2025-09-10 11:37:58 + [2025-09-09 17:24:50] iteration 246/ 11920 | consumed samples: 251904 | elapsed time per iteration (ms): 5606.1 | throughput per GPU (TFLOP/s/GPU): 80.5 | MFU 8.14% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.238293E+00 | loss scale: 1.0 | grad norm: 0.355 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:10:45.901825 | finish at 2025-09-10 11:35:36 + [2025-09-09 17:24:56] iteration 247/ 11920 | consumed samples: 252928 | elapsed time per iteration (ms): 5877.4 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.236501E+00 | loss scale: 1.0 | grad norm: 0.278 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:03:26.981124 | finish at 2025-09-10 12:28:23 + [2025-09-09 17:25:02] iteration 248/ 11920 | consumed samples: 253952 | elapsed time per iteration (ms): 5900.4 | throughput per GPU (TFLOP/s/GPU): 76.5 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.226505E+00 | loss scale: 1.0 | grad norm: 0.327 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:07:49.710011 | finish at 2025-09-10 12:32:52 + [2025-09-09 17:25:08] iteration 249/ 11920 | consumed samples: 254976 | elapsed time per iteration (ms): 6132.5 | throughput per GPU (TFLOP/s/GPU): 73.6 | MFU 7.44% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.224728E+00 | loss scale: 1.0 | grad norm: 0.258 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:52:52.228636 | finish at 2025-09-10 13:18:00 + [2025-09-09 17:25:14] iteration 250/ 11920 | consumed samples: 256000 | elapsed time per iteration (ms): 5844.6 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.215283E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:56:46.398089 | finish at 2025-09-10 12:22:00 + [2025-09-09 17:25:20] iteration 251/ 11920 | consumed samples: 257024 | elapsed time per iteration (ms): 5938.5 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.186505E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:14:56.867562 | finish at 2025-09-10 12:40:17 + [2025-09-09 17:25:26] iteration 252/ 11920 | consumed samples: 258048 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.190397E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:13:09.931856 | finish at 2025-09-10 11:38:35 + [2025-09-09 17:25:31] iteration 253/ 11920 | consumed samples: 259072 | elapsed time per iteration (ms): 5622.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.187006E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 2360324.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:13:19.189441 | finish at 2025-09-10 11:38:50 + [2025-09-09 17:25:37] iteration 254/ 11920 | consumed samples: 260096 | elapsed time per iteration (ms): 5633.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.172147E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 2360321.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:15:18.793387 | finish at 2025-09-10 11:40:56 + [2025-09-09 17:25:42] iteration 255/ 11920 | consumed samples: 261120 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.166578E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 2360320.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:13:30.335248 | finish at 2025-09-10 11:39:13 + [2025-09-09 17:25:48] iteration 256/ 11920 | consumed samples: 262144 | elapsed time per iteration (ms): 5644.4 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.163802E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 2360321.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:17:16.494347 | finish at 2025-09-10 11:43:05 + [2025-09-09 17:25:54] iteration 257/ 11920 | consumed samples: 263168 | elapsed time per iteration (ms): 5831.6 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.159737E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:53:33.902955 | finish at 2025-09-10 12:19:28 + [2025-09-09 17:26:00] iteration 258/ 11920 | consumed samples: 264192 | elapsed time per iteration (ms): 5982.9 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.165822E+00 | loss scale: 1.0 | grad norm: 0.332 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:22:52.659316 | finish at 2025-09-10 12:48:53 + [2025-09-09 17:26:05] iteration 259/ 11920 | consumed samples: 265216 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.197745E+00 | loss scale: 1.0 | grad norm: 0.438 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:13:42.631253 | finish at 2025-09-10 11:39:48 + [2025-09-09 17:26:12] iteration 260/ 11920 | consumed samples: 266240 | elapsed time per iteration (ms): 6279.2 | throughput per GPU (TFLOP/s/GPU): 71.9 | MFU 7.27% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.188039E+00 | loss scale: 1.0 | grad norm: 0.320 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 20:20:15.005865 | finish at 2025-09-10 13:46:27 + [2025-09-09 17:26:18] iteration 261/ 11920 | consumed samples: 267264 | elapsed time per iteration (ms): 6101.7 | throughput per GPU (TFLOP/s/GPU): 74.0 | MFU 7.48% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.197333E+00 | loss scale: 1.0 | grad norm: 0.317 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:45:40.293710 | finish at 2025-09-10 13:11:58 + [2025-09-09 17:26:24] iteration 262/ 11920 | consumed samples: 268288 | elapsed time per iteration (ms): 5652.2 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.195051E+00 | loss scale: 1.0 | grad norm: 0.364 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:18:12.849881 | finish at 2025-09-10 11:44:36 + [2025-09-09 17:26:29] iteration 263/ 11920 | consumed samples: 269312 | elapsed time per iteration (ms): 5642.4 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.180319E+00 | loss scale: 1.0 | grad norm: 0.388 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:16:13.879552 | finish at 2025-09-10 11:42:43 + [2025-09-09 17:26:36] iteration 264/ 11920 | consumed samples: 270336 | elapsed time per iteration (ms): 6507.0 | throughput per GPU (TFLOP/s/GPU): 69.4 | MFU 7.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.180731E+00 | loss scale: 1.0 | grad norm: 0.288 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 21:04:05.511053 | finish at 2025-09-10 14:30:41 + [2025-09-09 17:26:41] iteration 265/ 11920 | consumed samples: 271360 | elapsed time per iteration (ms): 5631.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.177909E+00 | loss scale: 1.0 | grad norm: 0.320 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:13:49.983509 | finish at 2025-09-10 11:40:31 + [2025-09-09 17:26:47] iteration 266/ 11920 | consumed samples: 272384 | elapsed time per iteration (ms): 5635.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.159184E+00 | loss scale: 1.0 | grad norm: 0.336 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:14:37.847491 | finish at 2025-09-10 11:41:25 + [2025-09-09 17:26:53] iteration 267/ 11920 | consumed samples: 273408 | elapsed time per iteration (ms): 5854.0 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.149394E+00 | loss scale: 1.0 | grad norm: 0.254 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:56:57.135376 | finish at 2025-09-10 12:23:50 + [2025-09-09 17:26:58] iteration 268/ 11920 | consumed samples: 274432 | elapsed time per iteration (ms): 5636.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.128089E+00 | loss scale: 1.0 | grad norm: 0.261 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:14:41.488784 | finish at 2025-09-10 11:41:40 + [2025-09-09 17:27:04] iteration 269/ 11920 | consumed samples: 275456 | elapsed time per iteration (ms): 5650.3 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.113611E+00 | loss scale: 1.0 | grad norm: 0.288 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:17:11.781715 | finish at 2025-09-10 11:44:16 + [2025-09-09 17:27:10] iteration 270/ 11920 | consumed samples: 276480 | elapsed time per iteration (ms): 5646.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.109164E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:16:17.184951 | finish at 2025-09-10 11:43:27 + [2025-09-09 17:27:16] iteration 271/ 11920 | consumed samples: 277504 | elapsed time per iteration (ms): 5877.2 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.111825E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:01:03.601482 | finish at 2025-09-10 12:28:19 + [2025-09-09 17:27:21] iteration 272/ 11920 | consumed samples: 278528 | elapsed time per iteration (ms): 5644.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.102077E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:15:42.018005 | finish at 2025-09-10 11:43:03 + [2025-09-09 17:27:27] iteration 273/ 11920 | consumed samples: 279552 | elapsed time per iteration (ms): 5655.7 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.078741E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:17:51.512671 | finish at 2025-09-10 11:45:18 + [2025-09-09 17:27:33] iteration 274/ 11920 | consumed samples: 280576 | elapsed time per iteration (ms): 5649.5 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.092850E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:16:33.845296 | finish at 2025-09-10 11:44:06 + [2025-09-09 17:27:38] iteration 275/ 11920 | consumed samples: 281600 | elapsed time per iteration (ms): 5659.2 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.080858E+00 | loss scale: 1.0 | grad norm: 0.275 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:18:21.430652 | finish at 2025-09-10 11:46:00 + [2025-09-09 17:27:44] iteration 276/ 11920 | consumed samples: 282624 | elapsed time per iteration (ms): 5658.7 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.107334E+00 | loss scale: 1.0 | grad norm: 0.361 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:18:10.005393 | finish at 2025-09-10 11:45:54 + [2025-09-09 17:27:50] iteration 277/ 11920 | consumed samples: 283648 | elapsed time per iteration (ms): 5684.1 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.065554E+00 | loss scale: 1.0 | grad norm: 0.296 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:22:59.558897 | finish at 2025-09-10 11:50:49 + [2025-09-09 17:27:55] iteration 278/ 11920 | consumed samples: 284672 | elapsed time per iteration (ms): 5700.4 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.073405E+00 | loss scale: 1.0 | grad norm: 0.260 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:26:04.047025 | finish at 2025-09-10 11:53:59 + [2025-09-09 17:28:01] iteration 279/ 11920 | consumed samples: 285696 | elapsed time per iteration (ms): 5709.2 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 8.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.077533E+00 | loss scale: 1.0 | grad norm: 0.362 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:27:40.790548 | finish at 2025-09-10 11:55:42 + [2025-09-09 17:28:07] iteration 280/ 11920 | consumed samples: 286720 | elapsed time per iteration (ms): 5915.3 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.068007E+00 | loss scale: 1.0 | grad norm: 0.244 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:07:33.865957 | finish at 2025-09-10 12:35:41 + [2025-09-09 17:28:13] iteration 281/ 11920 | consumed samples: 287744 | elapsed time per iteration (ms): 6228.2 | throughput per GPU (TFLOP/s/GPU): 72.5 | MFU 7.33% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.058188E+00 | loss scale: 1.0 | grad norm: 0.252 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 20:08:10.441113 | finish at 2025-09-10 13:36:24 + [2025-09-09 17:28:19] iteration 282/ 11920 | consumed samples: 288768 | elapsed time per iteration (ms): 5930.2 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.045082E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:10:15.324698 | finish at 2025-09-10 12:38:34 + [2025-09-09 17:28:25] iteration 283/ 11920 | consumed samples: 289792 | elapsed time per iteration (ms): 5732.9 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.028359E+00 | loss scale: 1.0 | grad norm: 0.256 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:31:53.198516 | finish at 2025-09-10 12:00:18 + [2025-09-09 17:28:31] iteration 284/ 11920 | consumed samples: 290816 | elapsed time per iteration (ms): 5745.0 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.053535E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:34:08.757635 | finish at 2025-09-10 12:02:39 + [2025-09-09 17:28:36] iteration 285/ 11920 | consumed samples: 291840 | elapsed time per iteration (ms): 5721.7 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.026081E+00 | loss scale: 1.0 | grad norm: 0.244 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:29:32.120428 | finish at 2025-09-10 11:58:08 + [2025-09-09 17:28:42] iteration 286/ 11920 | consumed samples: 292864 | elapsed time per iteration (ms): 5752.7 | throughput per GPU (TFLOP/s/GPU): 78.5 | MFU 7.94% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.039400E+00 | loss scale: 1.0 | grad norm: 0.276 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:35:26.843508 | finish at 2025-09-10 12:04:09 + [2025-09-09 17:28:48] iteration 287/ 11920 | consumed samples: 293888 | elapsed time per iteration (ms): 5721.3 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.051496E+00 | loss scale: 1.0 | grad norm: 0.288 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:29:16.372496 | finish at 2025-09-10 11:58:04 + [2025-09-09 17:28:53] iteration 288/ 11920 | consumed samples: 294912 | elapsed time per iteration (ms): 5723.9 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.029159E+00 | loss scale: 1.0 | grad norm: 0.265 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:29:40.380768 | finish at 2025-09-10 11:58:34 + [2025-09-09 17:28:59] iteration 289/ 11920 | consumed samples: 295936 | elapsed time per iteration (ms): 5726.4 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.997773E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:30:03.540922 | finish at 2025-09-10 11:59:03 + [2025-09-09 17:29:05] iteration 290/ 11920 | consumed samples: 296960 | elapsed time per iteration (ms): 5701.7 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.011593E+00 | loss scale: 1.0 | grad norm: 0.303 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:25:11.350193 | finish at 2025-09-10 11:54:16 + [2025-09-09 17:29:11] iteration 291/ 11920 | consumed samples: 297984 | elapsed time per iteration (ms): 5711.7 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.020154E+00 | loss scale: 1.0 | grad norm: 0.320 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:27:01.272916 | finish at 2025-09-10 11:56:12 + [2025-09-09 17:29:16] iteration 292/ 11920 | consumed samples: 299008 | elapsed time per iteration (ms): 5703.0 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.041184E+00 | loss scale: 1.0 | grad norm: 0.395 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:25:15.028175 | finish at 2025-09-10 11:54:31 + [2025-09-09 17:29:22] iteration 293/ 11920 | consumed samples: 300032 | elapsed time per iteration (ms): 5675.8 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.010777E+00 | loss scale: 1.0 | grad norm: 0.334 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:19:52.469375 | finish at 2025-09-10 11:49:14 + [2025-09-09 17:29:28] iteration 294/ 11920 | consumed samples: 301056 | elapsed time per iteration (ms): 5720.8 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.001766E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:28:30.538241 | finish at 2025-09-10 11:57:58 + [2025-09-09 17:29:33] iteration 295/ 11920 | consumed samples: 302080 | elapsed time per iteration (ms): 5666.5 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.995046E+00 | loss scale: 1.0 | grad norm: 0.276 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:17:53.193830 | finish at 2025-09-10 11:47:27 + [2025-09-09 17:29:39] iteration 296/ 11920 | consumed samples: 303104 | elapsed time per iteration (ms): 5725.3 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.980327E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:29:10.505606 | finish at 2025-09-10 11:58:50 + [2025-09-09 17:29:45] iteration 297/ 11920 | consumed samples: 304128 | elapsed time per iteration (ms): 5692.2 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.975224E+00 | loss scale: 1.0 | grad norm: 0.260 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:22:39.905136 | finish at 2025-09-10 11:52:25 + [2025-09-09 17:29:50] iteration 298/ 11920 | consumed samples: 305152 | elapsed time per iteration (ms): 5714.5 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.979126E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:26:53.766024 | finish at 2025-09-10 11:56:44 + [2025-09-09 17:29:56] iteration 299/ 11920 | consumed samples: 306176 | elapsed time per iteration (ms): 5716.7 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.970828E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:27:13.511153 | finish at 2025-09-10 11:57:10 + [2025-09-09 17:30:02] iteration 300/ 11920 | consumed samples: 307200 | elapsed time per iteration (ms): 5722.2 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.948147E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:28:11.572404 | finish at 2025-09-10 11:58:14 + [2025-09-09 17:30:08] iteration 301/ 11920 | consumed samples: 308224 | elapsed time per iteration (ms): 5724.6 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.969265E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:28:33.884515 | finish at 2025-09-10 11:58:42 + [2025-09-09 17:30:13] iteration 302/ 11920 | consumed samples: 309248 | elapsed time per iteration (ms): 5731.9 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.958071E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:29:53.374587 | finish at 2025-09-10 12:00:07 + [2025-09-09 17:30:19] iteration 303/ 11920 | consumed samples: 310272 | elapsed time per iteration (ms): 5734.5 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.971785E+00 | loss scale: 1.0 | grad norm: 0.339 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:30:18.062383 | finish at 2025-09-10 12:00:37 + [2025-09-09 17:30:25] iteration 304/ 11920 | consumed samples: 311296 | elapsed time per iteration (ms): 5742.1 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.973248E+00 | loss scale: 1.0 | grad norm: 0.376 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:31:40.419159 | finish at 2025-09-10 12:02:05 + [2025-09-09 17:30:31] iteration 305/ 11920 | consumed samples: 312320 | elapsed time per iteration (ms): 5766.4 | throughput per GPU (TFLOP/s/GPU): 78.3 | MFU 7.92% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.989270E+00 | loss scale: 1.0 | grad norm: 0.307 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:36:17.263302 | finish at 2025-09-10 12:06:48 + [2025-09-09 17:30:36] iteration 306/ 11920 | consumed samples: 313344 | elapsed time per iteration (ms): 5758.4 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.975654E+00 | loss scale: 1.0 | grad norm: 0.397 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:34:38.109786 | finish at 2025-09-10 12:05:14 + [2025-09-09 17:30:42] iteration 307/ 11920 | consumed samples: 314368 | elapsed time per iteration (ms): 5785.6 | throughput per GPU (TFLOP/s/GPU): 78.0 | MFU 7.89% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.956201E+00 | loss scale: 1.0 | grad norm: 0.396 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:39:47.903615 | finish at 2025-09-10 12:10:30 + [2025-09-09 17:30:48] iteration 308/ 11920 | consumed samples: 315392 | elapsed time per iteration (ms): 5797.2 | throughput per GPU (TFLOP/s/GPU): 77.9 | MFU 7.87% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.984341E+00 | loss scale: 1.0 | grad norm: 0.540 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:41:56.942025 | finish at 2025-09-10 12:12:45 + [2025-09-09 17:30:54] iteration 309/ 11920 | consumed samples: 316416 | elapsed time per iteration (ms): 5761.7 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.92% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.985085E+00 | loss scale: 1.0 | grad norm: 0.606 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:34:59.280419 | finish at 2025-09-10 12:05:53 + [2025-09-09 17:31:00] iteration 310/ 11920 | consumed samples: 317440 | elapsed time per iteration (ms): 5786.6 | throughput per GPU (TFLOP/s/GPU): 78.0 | MFU 7.89% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.949619E+00 | loss scale: 1.0 | grad norm: 0.313 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:39:42.737331 | finish at 2025-09-10 12:10:42 + [2025-09-09 17:31:05] iteration 311/ 11920 | consumed samples: 318464 | elapsed time per iteration (ms): 5786.0 | throughput per GPU (TFLOP/s/GPU): 78.0 | MFU 7.89% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.989074E+00 | loss scale: 1.0 | grad norm: 0.827 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:39:29.699065 | finish at 2025-09-10 12:10:35 + [2025-09-09 17:31:11] iteration 312/ 11920 | consumed samples: 319488 | elapsed time per iteration (ms): 5767.8 | throughput per GPU (TFLOP/s/GPU): 78.3 | MFU 7.91% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.944599E+00 | loss scale: 1.0 | grad norm: 0.272 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:35:52.368860 | finish at 2025-09-10 12:07:03 + [2025-09-09 17:31:17] iteration 313/ 11920 | consumed samples: 320512 | elapsed time per iteration (ms): 5699.4 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.952800E+00 | loss scale: 1.0 | grad norm: 0.388 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:22:33.001614 | finish at 2025-09-10 11:53:50 + [2025-09-09 17:31:23] iteration 314/ 11920 | consumed samples: 321536 | elapsed time per iteration (ms): 5732.5 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.935058E+00 | loss scale: 1.0 | grad norm: 0.340 | num zeros: 2360321.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:28:51.465063 | finish at 2025-09-10 12:00:14 + [2025-09-09 17:31:29] iteration 315/ 11920 | consumed samples: 322560 | elapsed time per iteration (ms): 6045.2 | throughput per GPU (TFLOP/s/GPU): 74.7 | MFU 7.55% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.960669E+00 | loss scale: 1.0 | grad norm: 0.541 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:29:14.840556 | finish at 2025-09-10 13:00:43 + [2025-09-09 17:31:34] iteration 316/ 11920 | consumed samples: 323584 | elapsed time per iteration (ms): 5730.0 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.949366E+00 | loss scale: 1.0 | grad norm: 0.347 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:28:10.668460 | finish at 2025-09-10 11:59:45 + [2025-09-09 17:31:40] iteration 317/ 11920 | consumed samples: 324608 | elapsed time per iteration (ms): 5707.5 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 8.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.920807E+00 | loss scale: 1.0 | grad norm: 0.267 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:23:44.600861 | finish at 2025-09-10 11:55:25 + [2025-09-09 17:31:46] iteration 318/ 11920 | consumed samples: 325632 | elapsed time per iteration (ms): 5737.4 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.912874E+00 | loss scale: 1.0 | grad norm: 0.259 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:29:25.605881 | finish at 2025-09-10 12:01:11 + [2025-09-09 17:31:51] iteration 319/ 11920 | consumed samples: 326656 | elapsed time per iteration (ms): 5713.3 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.891953E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 2360321.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:24:40.474446 | finish at 2025-09-10 11:56:32 + [2025-09-09 17:31:57] iteration 320/ 11920 | consumed samples: 327680 | elapsed time per iteration (ms): 5765.8 | throughput per GPU (TFLOP/s/GPU): 78.3 | MFU 7.92% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.905820E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 2360320.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:34:43.061504 | finish at 2025-09-10 12:06:40 + [2025-09-09 17:32:03] iteration 321/ 11920 | consumed samples: 328704 | elapsed time per iteration (ms): 5721.1 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.904671E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 2360321.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:25:58.981893 | finish at 2025-09-10 11:58:02 + [2025-09-09 17:32:09] iteration 322/ 11920 | consumed samples: 329728 | elapsed time per iteration (ms): 5744.5 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.885851E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 2360321.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:30:24.588425 | finish at 2025-09-10 12:02:33 + [2025-09-09 17:32:14] iteration 323/ 11920 | consumed samples: 330752 | elapsed time per iteration (ms): 5683.9 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.881457E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 2360320.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:18:36.164783 | finish at 2025-09-10 11:50:51 + [2025-09-09 17:32:20] iteration 324/ 11920 | consumed samples: 331776 | elapsed time per iteration (ms): 5736.0 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.880802E+00 | loss scale: 1.0 | grad norm: 0.290 | num zeros: 2360323.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:28:34.695414 | finish at 2025-09-10 12:00:55 + [2025-09-09 17:32:26] iteration 325/ 11920 | consumed samples: 332800 | elapsed time per iteration (ms): 5709.0 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 8.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.880045E+00 | loss scale: 1.0 | grad norm: 0.247 | num zeros: 2360321.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:23:15.740175 | finish at 2025-09-10 11:55:42 + [2025-09-09 17:32:32] iteration 326/ 11920 | consumed samples: 333824 | elapsed time per iteration (ms): 5760.3 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.875611E+00 | loss scale: 1.0 | grad norm: 0.246 | num zeros: 2360321.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:33:04.632569 | finish at 2025-09-10 12:05:36 + [2025-09-09 17:32:37] iteration 327/ 11920 | consumed samples: 334848 | elapsed time per iteration (ms): 5706.4 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 8.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.893231E+00 | loss scale: 1.0 | grad norm: 0.294 | num zeros: 2360321.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:22:33.871355 | finish at 2025-09-10 11:55:11 + [2025-09-09 17:32:43] iteration 328/ 11920 | consumed samples: 335872 | elapsed time per iteration (ms): 5721.8 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.859555E+00 | loss scale: 1.0 | grad norm: 0.294 | num zeros: 2360322.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:25:27.604105 | finish at 2025-09-10 11:58:11 + [2025-09-09 17:32:49] iteration 329/ 11920 | consumed samples: 336896 | elapsed time per iteration (ms): 5723.9 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.858120E+00 | loss scale: 1.0 | grad norm: 0.256 | num zeros: 2360321.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:25:46.248127 | finish at 2025-09-10 11:58:35 + [2025-09-09 17:32:54] iteration 330/ 11920 | consumed samples: 337920 | elapsed time per iteration (ms): 5724.1 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.862231E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 2360320.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:25:41.773181 | finish at 2025-09-10 11:58:36 + [2025-09-09 17:33:00] iteration 331/ 11920 | consumed samples: 338944 | elapsed time per iteration (ms): 5725.4 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.862700E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 2360321.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:25:51.168444 | finish at 2025-09-10 11:58:51 + [2025-09-09 17:33:06] iteration 332/ 11920 | consumed samples: 339968 | elapsed time per iteration (ms): 5762.8 | throughput per GPU (TFLOP/s/GPU): 78.3 | MFU 7.92% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.847669E+00 | loss scale: 1.0 | grad norm: 0.256 | num zeros: 2360321.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:32:59.453237 | finish at 2025-09-10 12:06:05 + [2025-09-09 17:33:12] iteration 333/ 11920 | consumed samples: 340992 | elapsed time per iteration (ms): 5726.9 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.860689E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 2360321.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:25:57.072106 | finish at 2025-09-10 11:59:09 + [2025-09-09 17:33:17] iteration 334/ 11920 | consumed samples: 342016 | elapsed time per iteration (ms): 5749.3 | throughput per GPU (TFLOP/s/GPU): 78.5 | MFU 7.94% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.839952E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 2360321.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:30:11.925725 | finish at 2025-09-10 12:03:29 + [2025-09-09 17:33:23] iteration 335/ 11920 | consumed samples: 343040 | elapsed time per iteration (ms): 5767.5 | throughput per GPU (TFLOP/s/GPU): 78.3 | MFU 7.92% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.851792E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 2360321.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:33:37.050080 | finish at 2025-09-10 12:07:00 + [2025-09-09 17:33:29] iteration 336/ 11920 | consumed samples: 344064 | elapsed time per iteration (ms): 5789.0 | throughput per GPU (TFLOP/s/GPU): 78.0 | MFU 7.89% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.855558E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:37:40.290100 | finish at 2025-09-10 12:11:09 + [2025-09-09 17:33:35] iteration 337/ 11920 | consumed samples: 345088 | elapsed time per iteration (ms): 5761.2 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.92% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.846094E+00 | loss scale: 1.0 | grad norm: 0.309 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:32:11.976273 | finish at 2025-09-10 12:05:47 + [2025-09-09 17:33:40] iteration 338/ 11920 | consumed samples: 346112 | elapsed time per iteration (ms): 5747.1 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.94% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.834653E+00 | loss scale: 1.0 | grad norm: 0.302 | num zeros: 2360322.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:29:23.239371 | finish at 2025-09-10 12:03:04 + [2025-09-09 17:33:46] iteration 339/ 11920 | consumed samples: 347136 | elapsed time per iteration (ms): 5781.5 | throughput per GPU (TFLOP/s/GPU): 78.1 | MFU 7.90% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.863203E+00 | loss scale: 1.0 | grad norm: 0.389 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:35:55.066713 | finish at 2025-09-10 12:09:41 + [2025-09-09 17:33:52] iteration 340/ 11920 | consumed samples: 348160 | elapsed time per iteration (ms): 5736.8 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.838677E+00 | loss scale: 1.0 | grad norm: 0.302 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:27:12.317419 | finish at 2025-09-10 12:01:04 + [2025-09-09 17:33:58] iteration 341/ 11920 | consumed samples: 349184 | elapsed time per iteration (ms): 5731.2 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.825063E+00 | loss scale: 1.0 | grad norm: 0.280 | num zeros: 2360321.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:26:01.440337 | finish at 2025-09-10 11:59:59 + [2025-09-09 17:34:03] iteration 342/ 11920 | consumed samples: 350208 | elapsed time per iteration (ms): 5758.2 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.814360E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:31:08.333896 | finish at 2025-09-10 12:05:12 + [2025-09-09 17:34:09] iteration 343/ 11920 | consumed samples: 351232 | elapsed time per iteration (ms): 5745.6 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.803607E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:28:37.255415 | finish at 2025-09-10 12:02:46 + [2025-09-09 17:34:15] iteration 344/ 11920 | consumed samples: 352256 | elapsed time per iteration (ms): 5743.8 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.790952E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:28:10.225170 | finish at 2025-09-10 12:02:25 + [2025-09-09 17:34:21] iteration 345/ 11920 | consumed samples: 353280 | elapsed time per iteration (ms): 5764.8 | throughput per GPU (TFLOP/s/GPU): 78.3 | MFU 7.92% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.811140E+00 | loss scale: 1.0 | grad norm: 0.255 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:32:07.980304 | finish at 2025-09-10 12:06:29 + [2025-09-09 17:34:26] iteration 346/ 11920 | consumed samples: 354304 | elapsed time per iteration (ms): 5757.3 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.776624E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:30:35.035954 | finish at 2025-09-10 12:05:02 + [2025-09-09 17:34:32] iteration 347/ 11920 | consumed samples: 355328 | elapsed time per iteration (ms): 5754.2 | throughput per GPU (TFLOP/s/GPU): 78.5 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.789894E+00 | loss scale: 1.0 | grad norm: 0.318 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:29:53.508145 | finish at 2025-09-10 12:04:26 + [2025-09-09 17:34:38] iteration 348/ 11920 | consumed samples: 356352 | elapsed time per iteration (ms): 5745.8 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.771427E+00 | loss scale: 1.0 | grad norm: 0.249 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:28:09.895678 | finish at 2025-09-10 12:02:48 + [2025-09-09 17:34:44] iteration 349/ 11920 | consumed samples: 357376 | elapsed time per iteration (ms): 5741.5 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.768489E+00 | loss scale: 1.0 | grad norm: 0.296 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:27:14.865006 | finish at 2025-09-10 12:01:59 + [2025-09-09 17:34:49] iteration 350/ 11920 | consumed samples: 358400 | elapsed time per iteration (ms): 5748.2 | throughput per GPU (TFLOP/s/GPU): 78.5 | MFU 7.94% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.773170E+00 | loss scale: 1.0 | grad norm: 0.254 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:28:27.051218 | finish at 2025-09-10 12:03:17 + [2025-09-09 17:34:55] iteration 351/ 11920 | consumed samples: 359424 | elapsed time per iteration (ms): 5774.7 | throughput per GPU (TFLOP/s/GPU): 78.2 | MFU 7.91% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.765376E+00 | loss scale: 1.0 | grad norm: 0.270 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:33:27.754450 | finish at 2025-09-10 12:08:23 + [2025-09-09 17:35:01] iteration 352/ 11920 | consumed samples: 360448 | elapsed time per iteration (ms): 5764.4 | throughput per GPU (TFLOP/s/GPU): 78.3 | MFU 7.92% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.767724E+00 | loss scale: 1.0 | grad norm: 0.251 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:31:22.565472 | finish at 2025-09-10 12:06:24 + [2025-09-09 17:35:07] iteration 353/ 11920 | consumed samples: 361472 | elapsed time per iteration (ms): 5750.8 | throughput per GPU (TFLOP/s/GPU): 78.5 | MFU 7.94% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.762756E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:28:39.723001 | finish at 2025-09-10 12:03:46 + [2025-09-09 17:35:13] iteration 354/ 11920 | consumed samples: 362496 | elapsed time per iteration (ms): 5803.2 | throughput per GPU (TFLOP/s/GPU): 77.8 | MFU 7.87% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.753166E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:38:40.178029 | finish at 2025-09-10 12:13:53 + [2025-09-09 17:35:18] iteration 355/ 11920 | consumed samples: 363520 | elapsed time per iteration (ms): 5762.0 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.92% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.748876E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:30:37.638506 | finish at 2025-09-10 12:05:56 + [2025-09-09 17:35:24] iteration 356/ 11920 | consumed samples: 364544 | elapsed time per iteration (ms): 5763.1 | throughput per GPU (TFLOP/s/GPU): 78.3 | MFU 7.92% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.739173E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:30:44.608657 | finish at 2025-09-10 12:06:09 + [2025-09-09 17:35:30] iteration 357/ 11920 | consumed samples: 365568 | elapsed time per iteration (ms): 5781.2 | throughput per GPU (TFLOP/s/GPU): 78.1 | MFU 7.90% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.759406E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:34:07.449664 | finish at 2025-09-10 12:09:37 + [2025-09-09 17:35:36] iteration 358/ 11920 | consumed samples: 366592 | elapsed time per iteration (ms): 5735.9 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.756441E+00 | loss scale: 1.0 | grad norm: 0.304 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:25:18.709246 | finish at 2025-09-10 12:00:54 + [2025-09-09 17:35:41] iteration 359/ 11920 | consumed samples: 367616 | elapsed time per iteration (ms): 5775.3 | throughput per GPU (TFLOP/s/GPU): 78.2 | MFU 7.90% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.763182E+00 | loss scale: 1.0 | grad norm: 0.348 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:32:47.799826 | finish at 2025-09-10 12:08:29 + [2025-09-09 17:35:47] iteration 360/ 11920 | consumed samples: 368640 | elapsed time per iteration (ms): 5768.4 | throughput per GPU (TFLOP/s/GPU): 78.3 | MFU 7.91% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.732139E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:31:22.405806 | finish at 2025-09-10 12:07:10 + [2025-09-09 17:35:53] iteration 361/ 11920 | consumed samples: 369664 | elapsed time per iteration (ms): 6221.8 | throughput per GPU (TFLOP/s/GPU): 72.6 | MFU 7.34% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.735390E+00 | loss scale: 1.0 | grad norm: 0.243 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:58:37.233295 | finish at 2025-09-10 13:34:31 + [2025-09-09 17:36:00] iteration 362/ 11920 | consumed samples: 370688 | elapsed time per iteration (ms): 6165.2 | throughput per GPU (TFLOP/s/GPU): 73.2 | MFU 7.40% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.723553E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:47:37.152824 | finish at 2025-09-10 13:23:37 + [2025-09-09 17:36:05] iteration 363/ 11920 | consumed samples: 371712 | elapsed time per iteration (ms): 5799.0 | throughput per GPU (TFLOP/s/GPU): 77.9 | MFU 7.87% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.747562E+00 | loss scale: 1.0 | grad norm: 0.321 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:36:59.553036 | finish at 2025-09-10 12:13:05 + [2025-09-09 17:36:11] iteration 364/ 11920 | consumed samples: 372736 | elapsed time per iteration (ms): 5796.5 | throughput per GPU (TFLOP/s/GPU): 77.9 | MFU 7.88% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.753744E+00 | loss scale: 1.0 | grad norm: 0.369 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:36:23.896268 | finish at 2025-09-10 12:12:35 + [2025-09-09 17:36:17] iteration 365/ 11920 | consumed samples: 373760 | elapsed time per iteration (ms): 5783.6 | throughput per GPU (TFLOP/s/GPU): 78.1 | MFU 7.89% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.716121E+00 | loss scale: 1.0 | grad norm: 0.299 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:33:49.138167 | finish at 2025-09-10 12:10:06 + [2025-09-09 17:36:23] iteration 366/ 11920 | consumed samples: 374784 | elapsed time per iteration (ms): 5797.6 | throughput per GPU (TFLOP/s/GPU): 77.9 | MFU 7.87% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.715438E+00 | loss scale: 1.0 | grad norm: 0.265 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:36:25.299966 | finish at 2025-09-10 12:12:48 + [2025-09-09 17:36:29] iteration 367/ 11920 | consumed samples: 375808 | elapsed time per iteration (ms): 5790.1 | throughput per GPU (TFLOP/s/GPU): 78.0 | MFU 7.88% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.712655E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:34:53.412051 | finish at 2025-09-10 12:11:22 + [2025-09-09 17:36:34] iteration 368/ 11920 | consumed samples: 376832 | elapsed time per iteration (ms): 5972.5 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.714610E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:09:53.748116 | finish at 2025-09-10 12:46:28 + [2025-09-09 17:36:41] iteration 369/ 11920 | consumed samples: 377856 | elapsed time per iteration (ms): 6030.8 | throughput per GPU (TFLOP/s/GPU): 74.9 | MFU 7.57% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.711037E+00 | loss scale: 1.0 | grad norm: 0.230 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:21:02.017107 | finish at 2025-09-10 12:57:43 + [2025-09-09 17:36:46] iteration 370/ 11920 | consumed samples: 378880 | elapsed time per iteration (ms): 5759.4 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.690717E+00 | loss scale: 1.0 | grad norm: 0.242 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:28:40.534515 | finish at 2025-09-10 12:05:27 + [2025-09-09 17:36:52] iteration 371/ 11920 | consumed samples: 379904 | elapsed time per iteration (ms): 6008.0 | throughput per GPU (TFLOP/s/GPU): 75.1 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.689062E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:16:26.556020 | finish at 2025-09-10 12:53:19 + [2025-09-09 17:36:58] iteration 372/ 11920 | consumed samples: 380928 | elapsed time per iteration (ms): 5986.9 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.683535E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:12:16.210147 | finish at 2025-09-10 12:49:14 + [2025-09-09 17:37:04] iteration 373/ 11920 | consumed samples: 381952 | elapsed time per iteration (ms): 6223.2 | throughput per GPU (TFLOP/s/GPU): 72.5 | MFU 7.34% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.680285E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:57:39.200506 | finish at 2025-09-10 13:34:44 + [2025-09-09 17:37:11] iteration 374/ 11920 | consumed samples: 382976 | elapsed time per iteration (ms): 6087.1 | throughput per GPU (TFLOP/s/GPU): 74.2 | MFU 7.50% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.672601E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:31:21.577104 | finish at 2025-09-10 13:08:32 + [2025-09-09 17:37:17] iteration 375/ 11920 | consumed samples: 384000 | elapsed time per iteration (ms): 6001.5 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.698653E+00 | loss scale: 1.0 | grad norm: 0.266 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:14:47.007960 | finish at 2025-09-10 12:52:04 + [2025-09-09 17:37:23] iteration 376/ 11920 | consumed samples: 385024 | elapsed time per iteration (ms): 6105.7 | throughput per GPU (TFLOP/s/GPU): 73.9 | MFU 7.48% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.677727E+00 | loss scale: 1.0 | grad norm: 0.303 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:34:44.531261 | finish at 2025-09-10 13:12:07 + [2025-09-09 17:37:29] iteration 377/ 11920 | consumed samples: 386048 | elapsed time per iteration (ms): 6353.9 | throughput per GPU (TFLOP/s/GPU): 71.1 | MFU 7.18% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.686694E+00 | loss scale: 1.0 | grad norm: 0.290 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 20:22:22.836016 | finish at 2025-09-10 13:59:52 + [2025-09-09 17:37:35] iteration 378/ 11920 | consumed samples: 387072 | elapsed time per iteration (ms): 6002.8 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.698783E+00 | loss scale: 1.0 | grad norm: 0.325 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:14:44.278934 | finish at 2025-09-10 12:52:19 + [2025-09-09 17:37:41] iteration 379/ 11920 | consumed samples: 388096 | elapsed time per iteration (ms): 5977.6 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.670756E+00 | loss scale: 1.0 | grad norm: 0.253 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:09:48.049554 | finish at 2025-09-10 12:47:29 + [2025-09-09 17:37:47] iteration 380/ 11920 | consumed samples: 389120 | elapsed time per iteration (ms): 6044.9 | throughput per GPU (TFLOP/s/GPU): 74.7 | MFU 7.55% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.651505E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:22:38.073778 | finish at 2025-09-10 13:00:25 + [2025-09-09 17:37:53] iteration 381/ 11920 | consumed samples: 390144 | elapsed time per iteration (ms): 6351.9 | throughput per GPU (TFLOP/s/GPU): 71.1 | MFU 7.19% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.678489E+00 | loss scale: 1.0 | grad norm: 0.270 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 20:21:34.423951 | finish at 2025-09-10 13:59:28 + [2025-09-09 17:37:59] iteration 382/ 11920 | consumed samples: 391168 | elapsed time per iteration (ms): 5746.0 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.94% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.653516E+00 | loss scale: 1.0 | grad norm: 0.229 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:24:57.653941 | finish at 2025-09-10 12:02:57 + [2025-09-09 17:38:05] iteration 383/ 11920 | consumed samples: 392192 | elapsed time per iteration (ms): 5780.6 | throughput per GPU (TFLOP/s/GPU): 78.1 | MFU 7.90% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.652985E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:31:30.609729 | finish at 2025-09-10 12:09:36 + [2025-09-09 17:38:11] iteration 384/ 11920 | consumed samples: 393216 | elapsed time per iteration (ms): 6114.2 | throughput per GPU (TFLOP/s/GPU): 73.8 | MFU 7.47% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.666747E+00 | loss scale: 1.0 | grad norm: 0.329 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:35:33.497791 | finish at 2025-09-10 13:13:45 + [2025-09-09 17:38:17] iteration 385/ 11920 | consumed samples: 394240 | elapsed time per iteration (ms): 6144.6 | throughput per GPU (TFLOP/s/GPU): 73.5 | MFU 7.43% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.664915E+00 | loss scale: 1.0 | grad norm: 0.416 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:41:18.378038 | finish at 2025-09-10 13:19:36 + [2025-09-09 17:38:23] iteration 386/ 11920 | consumed samples: 395264 | elapsed time per iteration (ms): 5754.9 | throughput per GPU (TFLOP/s/GPU): 78.5 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.656960E+00 | loss scale: 1.0 | grad norm: 0.403 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:26:17.222598 | finish at 2025-09-10 12:04:40 + [2025-09-09 17:38:29] iteration 387/ 11920 | consumed samples: 396288 | elapsed time per iteration (ms): 5761.5 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.92% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.627171E+00 | loss scale: 1.0 | grad norm: 0.291 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:27:27.534868 | finish at 2025-09-10 12:05:56 + [2025-09-09 17:38:34] iteration 388/ 11920 | consumed samples: 397312 | elapsed time per iteration (ms): 5761.1 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.92% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.633615E+00 | loss scale: 1.0 | grad norm: 0.287 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:27:16.796863 | finish at 2025-09-10 12:05:51 + [2025-09-09 17:38:40] iteration 389/ 11920 | consumed samples: 398336 | elapsed time per iteration (ms): 5764.7 | throughput per GPU (TFLOP/s/GPU): 78.3 | MFU 7.92% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.627338E+00 | loss scale: 1.0 | grad norm: 0.256 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:27:52.345330 | finish at 2025-09-10 12:06:33 + [2025-09-09 17:38:46] iteration 390/ 11920 | consumed samples: 399360 | elapsed time per iteration (ms): 5755.2 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.645159E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:25:57.842557 | finish at 2025-09-10 12:04:44 + [2025-09-09 17:38:52] iteration 391/ 11920 | consumed samples: 400384 | elapsed time per iteration (ms): 5773.4 | throughput per GPU (TFLOP/s/GPU): 78.2 | MFU 7.91% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.622124E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:29:21.752034 | finish at 2025-09-10 12:08:14 + [2025-09-09 17:38:58] iteration 392/ 11920 | consumed samples: 401408 | elapsed time per iteration (ms): 5772.7 | throughput per GPU (TFLOP/s/GPU): 78.2 | MFU 7.91% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.621868E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:29:07.936535 | finish at 2025-09-10 12:08:05 + [2025-09-09 17:39:04] iteration 393/ 11920 | consumed samples: 402432 | elapsed time per iteration (ms): 6007.8 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.598460E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:14:11.985981 | finish at 2025-09-10 12:53:16 + [2025-09-09 17:39:09] iteration 394/ 11920 | consumed samples: 403456 | elapsed time per iteration (ms): 5789.5 | throughput per GPU (TFLOP/s/GPU): 78.0 | MFU 7.89% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.614348E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:32:09.672554 | finish at 2025-09-10 12:11:19 + [2025-09-09 17:39:15] iteration 395/ 11920 | consumed samples: 404480 | elapsed time per iteration (ms): 6083.7 | throughput per GPU (TFLOP/s/GPU): 74.2 | MFU 7.50% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.604483E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:28:34.174706 | finish at 2025-09-10 13:07:50 + [2025-09-09 17:39:21] iteration 396/ 11920 | consumed samples: 405504 | elapsed time per iteration (ms): 5794.3 | throughput per GPU (TFLOP/s/GPU): 77.9 | MFU 7.88% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.594023E+00 | loss scale: 1.0 | grad norm: 0.229 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:32:53.827333 | finish at 2025-09-10 12:12:15 + [2025-09-09 17:39:27] iteration 397/ 11920 | consumed samples: 406528 | elapsed time per iteration (ms): 5785.5 | throughput per GPU (TFLOP/s/GPU): 78.0 | MFU 7.89% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.598106E+00 | loss scale: 1.0 | grad norm: 0.230 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:31:06.421469 | finish at 2025-09-10 12:10:33 + [2025-09-09 17:39:33] iteration 398/ 11920 | consumed samples: 407552 | elapsed time per iteration (ms): 5760.9 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.92% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.588027E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:26:16.672484 | finish at 2025-09-10 12:05:49 + [2025-09-09 17:39:39] iteration 399/ 11920 | consumed samples: 408576 | elapsed time per iteration (ms): 5778.1 | throughput per GPU (TFLOP/s/GPU): 78.1 | MFU 7.90% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.584644E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:29:29.781421 | finish at 2025-09-10 12:09:08 + [2025-09-09 17:39:44] iteration 400/ 11920 | consumed samples: 409600 | elapsed time per iteration (ms): 5766.4 | throughput per GPU (TFLOP/s/GPU): 78.3 | MFU 7.92% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.591026E+00 | loss scale: 1.0 | grad norm: 0.259 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:27:09.415283 | finish at 2025-09-10 12:06:54 + [2025-09-09 17:39:50] iteration 401/ 11920 | consumed samples: 410624 | elapsed time per iteration (ms): 5778.6 | throughput per GPU (TFLOP/s/GPU): 78.1 | MFU 7.90% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.601693E+00 | loss scale: 1.0 | grad norm: 0.275 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:29:23.907356 | finish at 2025-09-10 12:09:14 + [2025-09-09 17:39:56] iteration 402/ 11920 | consumed samples: 411648 | elapsed time per iteration (ms): 5764.1 | throughput per GPU (TFLOP/s/GPU): 78.3 | MFU 7.92% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.605924E+00 | loss scale: 1.0 | grad norm: 0.315 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:26:31.195748 | finish at 2025-09-10 12:06:27 + [2025-09-09 17:40:02] iteration 403/ 11920 | consumed samples: 412672 | elapsed time per iteration (ms): 5770.7 | throughput per GPU (TFLOP/s/GPU): 78.2 | MFU 7.91% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.587543E+00 | loss scale: 1.0 | grad norm: 0.251 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:27:40.918247 | finish at 2025-09-10 12:07:43 + [2025-09-09 17:40:07] iteration 404/ 11920 | consumed samples: 413696 | elapsed time per iteration (ms): 5777.3 | throughput per GPU (TFLOP/s/GPU): 78.1 | MFU 7.90% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.598947E+00 | loss scale: 1.0 | grad norm: 0.362 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:28:51.487018 | finish at 2025-09-10 12:08:59 + [2025-09-09 17:40:13] iteration 405/ 11920 | consumed samples: 414720 | elapsed time per iteration (ms): 5757.7 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.578857E+00 | loss scale: 1.0 | grad norm: 0.337 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:24:59.374272 | finish at 2025-09-10 12:05:13 + [2025-09-09 17:40:19] iteration 406/ 11920 | consumed samples: 415744 | elapsed time per iteration (ms): 5765.9 | throughput per GPU (TFLOP/s/GPU): 78.3 | MFU 7.92% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.576154E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:26:28.192579 | finish at 2025-09-10 12:06:47 + [2025-09-09 17:40:25] iteration 407/ 11920 | consumed samples: 416768 | elapsed time per iteration (ms): 5793.0 | throughput per GPU (TFLOP/s/GPU): 77.9 | MFU 7.88% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.558187E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:31:35.124467 | finish at 2025-09-10 12:12:00 + [2025-09-09 17:40:31] iteration 408/ 11920 | consumed samples: 417792 | elapsed time per iteration (ms): 5771.3 | throughput per GPU (TFLOP/s/GPU): 78.2 | MFU 7.91% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.560146E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:27:19.132385 | finish at 2025-09-10 12:07:50 + [2025-09-09 17:40:36] iteration 409/ 11920 | consumed samples: 418816 | elapsed time per iteration (ms): 5782.3 | throughput per GPU (TFLOP/s/GPU): 78.1 | MFU 7.89% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.569625E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:29:19.712193 | finish at 2025-09-10 12:09:56 + [2025-09-09 17:40:42] iteration 410/ 11920 | consumed samples: 419840 | elapsed time per iteration (ms): 5778.6 | throughput per GPU (TFLOP/s/GPU): 78.1 | MFU 7.90% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.548127E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:28:31.636345 | finish at 2025-09-10 12:09:14 + [2025-09-09 17:40:48] iteration 411/ 11920 | consumed samples: 420864 | elapsed time per iteration (ms): 5764.4 | throughput per GPU (TFLOP/s/GPU): 78.3 | MFU 7.92% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.558393E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:25:41.993981 | finish at 2025-09-10 12:06:30 + [2025-09-09 17:40:54] iteration 412/ 11920 | consumed samples: 421888 | elapsed time per iteration (ms): 5754.6 | throughput per GPU (TFLOP/s/GPU): 78.5 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.553625E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:23:44.176057 | finish at 2025-09-10 12:04:38 + [2025-09-09 17:40:59] iteration 413/ 11920 | consumed samples: 422912 | elapsed time per iteration (ms): 5783.6 | throughput per GPU (TFLOP/s/GPU): 78.1 | MFU 7.89% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.550239E+00 | loss scale: 1.0 | grad norm: 0.281 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:29:12.459646 | finish at 2025-09-10 12:10:12 + [2025-09-09 17:41:05] iteration 414/ 11920 | consumed samples: 423936 | elapsed time per iteration (ms): 5780.9 | throughput per GPU (TFLOP/s/GPU): 78.1 | MFU 7.90% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.539599E+00 | loss scale: 1.0 | grad norm: 0.274 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:28:35.537431 | finish at 2025-09-10 12:09:41 + [2025-09-09 17:41:11] iteration 415/ 11920 | consumed samples: 424960 | elapsed time per iteration (ms): 5784.3 | throughput per GPU (TFLOP/s/GPU): 78.1 | MFU 7.89% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.528686E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:29:07.812949 | finish at 2025-09-10 12:10:19 + [2025-09-09 17:41:17] iteration 416/ 11920 | consumed samples: 425984 | elapsed time per iteration (ms): 5775.3 | throughput per GPU (TFLOP/s/GPU): 78.2 | MFU 7.90% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.542018E+00 | loss scale: 1.0 | grad norm: 0.267 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:27:18.771736 | finish at 2025-09-10 12:08:35 + [2025-09-09 17:41:23] iteration 417/ 11920 | consumed samples: 427008 | elapsed time per iteration (ms): 5802.5 | throughput per GPU (TFLOP/s/GPU): 77.8 | MFU 7.87% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.542270E+00 | loss scale: 1.0 | grad norm: 0.320 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:32:25.631044 | finish at 2025-09-10 12:13:48 + [2025-09-09 17:41:28] iteration 418/ 11920 | consumed samples: 428032 | elapsed time per iteration (ms): 5779.0 | throughput per GPU (TFLOP/s/GPU): 78.1 | MFU 7.90% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.532280E+00 | loss scale: 1.0 | grad norm: 0.279 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:27:50.135289 | finish at 2025-09-10 12:09:18 + [2025-09-09 17:41:34] iteration 419/ 11920 | consumed samples: 429056 | elapsed time per iteration (ms): 5784.1 | throughput per GPU (TFLOP/s/GPU): 78.1 | MFU 7.89% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.525709E+00 | loss scale: 1.0 | grad norm: 0.250 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:28:42.956677 | finish at 2025-09-10 12:10:17 + [2025-09-09 17:41:40] iteration 420/ 11920 | consumed samples: 430080 | elapsed time per iteration (ms): 5756.7 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.524934E+00 | loss scale: 1.0 | grad norm: 0.279 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:23:21.732397 | finish at 2025-09-10 12:05:02 + [2025-09-09 17:41:46] iteration 421/ 11920 | consumed samples: 431104 | elapsed time per iteration (ms): 5791.0 | throughput per GPU (TFLOP/s/GPU): 78.0 | MFU 7.88% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.522490E+00 | loss scale: 1.0 | grad norm: 0.313 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:29:50.340357 | finish at 2025-09-10 12:11:36 + [2025-09-09 17:41:51] iteration 422/ 11920 | consumed samples: 432128 | elapsed time per iteration (ms): 5741.9 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.539494E+00 | loss scale: 1.0 | grad norm: 0.287 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:20:20.618026 | finish at 2025-09-10 12:02:12 + [2025-09-09 17:41:57] iteration 423/ 11920 | consumed samples: 433152 | elapsed time per iteration (ms): 5740.0 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.513245E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:19:52.530670 | finish at 2025-09-10 12:01:50 + [2025-09-09 17:42:03] iteration 424/ 11920 | consumed samples: 434176 | elapsed time per iteration (ms): 5743.6 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.530481E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:20:28.468208 | finish at 2025-09-10 12:02:31 + [2025-09-09 17:42:09] iteration 425/ 11920 | consumed samples: 435200 | elapsed time per iteration (ms): 5757.5 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.512266E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:23:02.546692 | finish at 2025-09-10 12:05:11 + [2025-09-09 17:42:14] iteration 426/ 11920 | consumed samples: 436224 | elapsed time per iteration (ms): 5743.3 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.498839E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:20:13.355474 | finish at 2025-09-10 12:02:28 + [2025-09-09 17:42:20] iteration 427/ 11920 | consumed samples: 437248 | elapsed time per iteration (ms): 5740.3 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.482875E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:19:33.735776 | finish at 2025-09-10 12:01:54 + [2025-09-09 17:42:26] iteration 428/ 11920 | consumed samples: 438272 | elapsed time per iteration (ms): 5768.1 | throughput per GPU (TFLOP/s/GPU): 78.3 | MFU 7.91% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.480015E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:24:46.495845 | finish at 2025-09-10 12:07:12 + [2025-09-09 17:42:32] iteration 429/ 11920 | consumed samples: 439296 | elapsed time per iteration (ms): 5771.9 | throughput per GPU (TFLOP/s/GPU): 78.2 | MFU 7.91% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.481963E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:25:24.880277 | finish at 2025-09-10 12:07:57 + [2025-09-09 17:42:37] iteration 430/ 11920 | consumed samples: 440320 | elapsed time per iteration (ms): 5777.0 | throughput per GPU (TFLOP/s/GPU): 78.2 | MFU 7.90% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.476845E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:26:18.247182 | finish at 2025-09-10 12:08:56 + [2025-09-09 17:42:43] iteration 431/ 11920 | consumed samples: 441344 | elapsed time per iteration (ms): 6013.9 | throughput per GPU (TFLOP/s/GPU): 75.1 | MFU 7.59% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.480577E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:11:33.464746 | finish at 2025-09-10 12:54:17 + [2025-09-09 17:42:49] iteration 432/ 11920 | consumed samples: 442368 | elapsed time per iteration (ms): 5743.1 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.477592E+00 | loss scale: 1.0 | grad norm: 0.230 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:19:36.981216 | finish at 2025-09-10 12:02:26 + [2025-09-09 17:42:55] iteration 433/ 11920 | consumed samples: 443392 | elapsed time per iteration (ms): 5766.4 | throughput per GPU (TFLOP/s/GPU): 78.3 | MFU 7.92% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.506227E+00 | loss scale: 1.0 | grad norm: 0.316 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:23:58.421576 | finish at 2025-09-10 12:06:53 + [2025-09-09 17:43:01] iteration 434/ 11920 | consumed samples: 444416 | elapsed time per iteration (ms): 6271.7 | throughput per GPU (TFLOP/s/GPU): 72.0 | MFU 7.28% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.501261E+00 | loss scale: 1.0 | grad norm: 0.371 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 20:00:36.846437 | finish at 2025-09-10 13:43:38 + [2025-09-09 17:43:07] iteration 435/ 11920 | consumed samples: 445440 | elapsed time per iteration (ms): 6191.9 | throughput per GPU (TFLOP/s/GPU): 72.9 | MFU 7.37% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.496781E+00 | loss scale: 1.0 | grad norm: 0.377 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:45:14.018221 | finish at 2025-09-10 13:28:21 + [2025-09-09 17:43:13] iteration 436/ 11920 | consumed samples: 446464 | elapsed time per iteration (ms): 5770.0 | throughput per GPU (TFLOP/s/GPU): 78.2 | MFU 7.91% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.502164E+00 | loss scale: 1.0 | grad norm: 0.254 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:24:22.490859 | finish at 2025-09-10 12:07:36 + [2025-09-09 17:43:19] iteration 437/ 11920 | consumed samples: 447488 | elapsed time per iteration (ms): 5776.9 | throughput per GPU (TFLOP/s/GPU): 78.2 | MFU 7.90% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.489553E+00 | loss scale: 1.0 | grad norm: 0.271 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:25:36.474578 | finish at 2025-09-10 12:08:55 + [2025-09-09 17:43:25] iteration 438/ 11920 | consumed samples: 448512 | elapsed time per iteration (ms): 5767.4 | throughput per GPU (TFLOP/s/GPU): 78.3 | MFU 7.92% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.536650E+00 | loss scale: 1.0 | grad norm: 0.484 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:23:41.705943 | finish at 2025-09-10 12:07:06 + [2025-09-09 17:43:30] iteration 439/ 11920 | consumed samples: 449536 | elapsed time per iteration (ms): 5756.9 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.551234E+00 | loss scale: 1.0 | grad norm: 0.414 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:21:34.693262 | finish at 2025-09-10 12:05:05 + [2025-09-09 17:43:36] iteration 440/ 11920 | consumed samples: 450560 | elapsed time per iteration (ms): 5764.9 | throughput per GPU (TFLOP/s/GPU): 78.3 | MFU 7.92% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.549653E+00 | loss scale: 1.0 | grad norm: 0.499 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:23:01.120071 | finish at 2025-09-10 12:06:37 + [2025-09-09 17:43:42] iteration 441/ 11920 | consumed samples: 451584 | elapsed time per iteration (ms): 5757.8 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.539360E+00 | loss scale: 1.0 | grad norm: 0.451 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:21:33.305696 | finish at 2025-09-10 12:05:15 + [2025-09-09 17:43:48] iteration 442/ 11920 | consumed samples: 452608 | elapsed time per iteration (ms): 5785.5 | throughput per GPU (TFLOP/s/GPU): 78.0 | MFU 7.89% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.531888E+00 | loss scale: 1.0 | grad norm: 0.315 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:26:45.561821 | finish at 2025-09-10 12:10:33 + [2025-09-09 17:43:54] iteration 443/ 11920 | consumed samples: 453632 | elapsed time per iteration (ms): 5744.6 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.527062E+00 | loss scale: 1.0 | grad norm: 0.334 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:18:50.287794 | finish at 2025-09-10 12:02:44 + [2025-09-09 17:43:59] iteration 444/ 11920 | consumed samples: 454656 | elapsed time per iteration (ms): 5739.7 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.530929E+00 | loss scale: 1.0 | grad norm: 0.440 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:17:48.951327 | finish at 2025-09-10 12:01:48 + [2025-09-09 17:44:05] iteration 445/ 11920 | consumed samples: 455680 | elapsed time per iteration (ms): 5738.5 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.524708E+00 | loss scale: 1.0 | grad norm: 0.380 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:17:29.718386 | finish at 2025-09-10 12:01:35 + [2025-09-09 17:44:11] iteration 446/ 11920 | consumed samples: 456704 | elapsed time per iteration (ms): 5753.4 | throughput per GPU (TFLOP/s/GPU): 78.5 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.506865E+00 | loss scale: 1.0 | grad norm: 0.283 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:20:14.583728 | finish at 2025-09-10 12:04:25 + [2025-09-09 17:44:16] iteration 447/ 11920 | consumed samples: 457728 | elapsed time per iteration (ms): 5741.0 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.548657E+00 | loss scale: 1.0 | grad norm: 0.322 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:17:45.934261 | finish at 2025-09-10 12:02:02 + [2025-09-09 17:44:22] iteration 448/ 11920 | consumed samples: 458752 | elapsed time per iteration (ms): 5738.0 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.497552E+00 | loss scale: 1.0 | grad norm: 0.330 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:17:06.376064 | finish at 2025-09-10 12:01:29 + [2025-09-09 17:44:28] iteration 449/ 11920 | consumed samples: 459776 | elapsed time per iteration (ms): 5745.7 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.498536E+00 | loss scale: 1.0 | grad norm: 0.265 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:18:29.087446 | finish at 2025-09-10 12:02:57 + [2025-09-09 17:44:34] iteration 450/ 11920 | consumed samples: 460800 | elapsed time per iteration (ms): 5720.4 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.496465E+00 | loss scale: 1.0 | grad norm: 0.271 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:13:33.311779 | finish at 2025-09-10 11:58:07 + [2025-09-09 17:44:39] iteration 451/ 11920 | consumed samples: 461824 | elapsed time per iteration (ms): 5770.5 | throughput per GPU (TFLOP/s/GPU): 78.2 | MFU 7.91% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.468564E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:23:02.142777 | finish at 2025-09-10 12:07:42 + [2025-09-09 17:44:45] iteration 452/ 11920 | consumed samples: 462848 | elapsed time per iteration (ms): 5750.1 | throughput per GPU (TFLOP/s/GPU): 78.5 | MFU 7.94% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.463735E+00 | loss scale: 1.0 | grad norm: 0.265 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:19:01.637065 | finish at 2025-09-10 12:03:47 + [2025-09-09 17:44:51] iteration 453/ 11920 | consumed samples: 463872 | elapsed time per iteration (ms): 5759.8 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.475563E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:20:47.377321 | finish at 2025-09-10 12:05:38 + [2025-09-09 17:44:57] iteration 454/ 11920 | consumed samples: 464896 | elapsed time per iteration (ms): 5726.2 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.464780E+00 | loss scale: 1.0 | grad norm: 0.247 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:14:16.654129 | finish at 2025-09-10 11:59:13 + [2025-09-09 17:45:02] iteration 455/ 11920 | consumed samples: 465920 | elapsed time per iteration (ms): 5738.6 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.471672E+00 | loss scale: 1.0 | grad norm: 0.330 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:16:33.412731 | finish at 2025-09-10 12:01:36 + [2025-09-09 17:45:08] iteration 456/ 11920 | consumed samples: 466944 | elapsed time per iteration (ms): 5736.5 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.448908E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:16:02.804434 | finish at 2025-09-10 12:01:11 + [2025-09-09 17:45:14] iteration 457/ 11920 | consumed samples: 467968 | elapsed time per iteration (ms): 5735.2 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.447461E+00 | loss scale: 1.0 | grad norm: 0.272 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:15:42.615909 | finish at 2025-09-10 12:00:57 + [2025-09-09 17:45:20] iteration 458/ 11920 | consumed samples: 468992 | elapsed time per iteration (ms): 5742.0 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.441551E+00 | loss scale: 1.0 | grad norm: 0.313 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:16:54.783319 | finish at 2025-09-10 12:02:14 + [2025-09-09 17:45:25] iteration 459/ 11920 | consumed samples: 470016 | elapsed time per iteration (ms): 5737.8 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.424867E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:16:00.826087 | finish at 2025-09-10 12:01:26 + [2025-09-09 17:45:31] iteration 460/ 11920 | consumed samples: 471040 | elapsed time per iteration (ms): 5758.4 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.430031E+00 | loss scale: 1.0 | grad norm: 0.274 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:19:51.167951 | finish at 2025-09-10 12:05:22 + [2025-09-09 17:45:37] iteration 461/ 11920 | consumed samples: 472064 | elapsed time per iteration (ms): 5731.4 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.430391E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:14:36.552607 | finish at 2025-09-10 12:00:13 + [2025-09-09 17:45:43] iteration 462/ 11920 | consumed samples: 473088 | elapsed time per iteration (ms): 5770.4 | throughput per GPU (TFLOP/s/GPU): 78.2 | MFU 7.91% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.440589E+00 | loss scale: 1.0 | grad norm: 0.240 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:21:57.295646 | finish at 2025-09-10 12:07:40 + [2025-09-09 17:45:48] iteration 463/ 11920 | consumed samples: 474112 | elapsed time per iteration (ms): 5746.3 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.94% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.434651E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:17:15.571956 | finish at 2025-09-10 12:03:04 + [2025-09-09 17:45:54] iteration 464/ 11920 | consumed samples: 475136 | elapsed time per iteration (ms): 5789.2 | throughput per GPU (TFLOP/s/GPU): 78.0 | MFU 7.89% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.411799E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:25:21.409195 | finish at 2025-09-10 12:11:16 + [2025-09-09 17:46:00] iteration 465/ 11920 | consumed samples: 476160 | elapsed time per iteration (ms): 5746.8 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.94% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.414753E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:17:09.628884 | finish at 2025-09-10 12:03:10 + [2025-09-09 17:46:06] iteration 466/ 11920 | consumed samples: 477184 | elapsed time per iteration (ms): 5749.1 | throughput per GPU (TFLOP/s/GPU): 78.5 | MFU 7.94% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.400288E+00 | loss scale: 1.0 | grad norm: 0.281 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:17:30.185593 | finish at 2025-09-10 12:03:36 + [2025-09-09 17:46:11] iteration 467/ 11920 | consumed samples: 478208 | elapsed time per iteration (ms): 5747.3 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.94% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.383314E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:17:04.325566 | finish at 2025-09-10 12:03:16 + [2025-09-09 17:46:17] iteration 468/ 11920 | consumed samples: 479232 | elapsed time per iteration (ms): 5715.7 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.371886E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:10:56.394699 | finish at 2025-09-10 11:57:14 + [2025-09-09 17:46:23] iteration 469/ 11920 | consumed samples: 480256 | elapsed time per iteration (ms): 5744.8 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.377996E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:16:23.361843 | finish at 2025-09-10 12:02:46 + [2025-09-09 17:46:29] iteration 470/ 11920 | consumed samples: 481280 | elapsed time per iteration (ms): 5718.7 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.374503E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:11:19.086924 | finish at 2025-09-10 11:57:48 + [2025-09-09 17:46:34] iteration 471/ 11920 | consumed samples: 482304 | elapsed time per iteration (ms): 5751.6 | throughput per GPU (TFLOP/s/GPU): 78.5 | MFU 7.94% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.370996E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:17:30.183356 | finish at 2025-09-10 12:04:05 + [2025-09-09 17:46:40] iteration 472/ 11920 | consumed samples: 483328 | elapsed time per iteration (ms): 5735.6 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.360290E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:14:21.399845 | finish at 2025-09-10 12:01:02 + [2025-09-09 17:46:46] iteration 473/ 11920 | consumed samples: 484352 | elapsed time per iteration (ms): 5728.5 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.365047E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:12:54.116400 | finish at 2025-09-10 11:59:40 + [2025-09-09 17:46:52] iteration 474/ 11920 | consumed samples: 485376 | elapsed time per iteration (ms): 5995.3 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.351441E+00 | loss scale: 1.0 | grad norm: 0.250 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:03:42.204424 | finish at 2025-09-10 12:50:34 + [2025-09-09 17:46:58] iteration 475/ 11920 | consumed samples: 486400 | elapsed time per iteration (ms): 5727.3 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.359597E+00 | loss scale: 1.0 | grad norm: 0.277 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:12:28.421044 | finish at 2025-09-10 11:59:26 + [2025-09-09 17:47:03] iteration 476/ 11920 | consumed samples: 487424 | elapsed time per iteration (ms): 5765.8 | throughput per GPU (TFLOP/s/GPU): 78.3 | MFU 7.92% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.355680E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:19:43.722424 | finish at 2025-09-10 12:06:47 + [2025-09-09 17:47:09] iteration 477/ 11920 | consumed samples: 488448 | elapsed time per iteration (ms): 5730.3 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.369287E+00 | loss scale: 1.0 | grad norm: 0.275 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:12:51.855062 | finish at 2025-09-10 12:00:01 + [2025-09-09 17:47:15] iteration 478/ 11920 | consumed samples: 489472 | elapsed time per iteration (ms): 5739.5 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.355654E+00 | loss scale: 1.0 | grad norm: 0.263 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:14:31.894209 | finish at 2025-09-10 12:01:47 + [2025-09-09 17:47:21] iteration 479/ 11920 | consumed samples: 490496 | elapsed time per iteration (ms): 5741.2 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.332106E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:14:44.629692 | finish at 2025-09-10 12:02:05 + [2025-09-09 17:47:26] iteration 480/ 11920 | consumed samples: 491520 | elapsed time per iteration (ms): 5719.2 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.355010E+00 | loss scale: 1.0 | grad norm: 0.354 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:10:27.374058 | finish at 2025-09-10 11:57:54 + [2025-09-09 17:47:32] iteration 481/ 11920 | consumed samples: 492544 | elapsed time per iteration (ms): 5714.6 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.350406E+00 | loss scale: 1.0 | grad norm: 0.299 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:09:29.184932 | finish at 2025-09-10 11:57:01 + [2025-09-09 17:47:38] iteration 482/ 11920 | consumed samples: 493568 | elapsed time per iteration (ms): 5722.8 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.339314E+00 | loss scale: 1.0 | grad norm: 0.260 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:10:57.844729 | finish at 2025-09-10 11:58:36 + [2025-09-09 17:47:44] iteration 483/ 11920 | consumed samples: 494592 | elapsed time per iteration (ms): 6047.7 | throughput per GPU (TFLOP/s/GPU): 74.7 | MFU 7.55% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.335210E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:12:47.503178 | finish at 2025-09-10 13:00:31 + [2025-09-09 17:47:50] iteration 484/ 11920 | consumed samples: 495616 | elapsed time per iteration (ms): 6421.7 | throughput per GPU (TFLOP/s/GPU): 70.3 | MFU 7.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.321958E+00 | loss scale: 1.0 | grad norm: 0.285 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 20:23:58.225842 | finish at 2025-09-10 14:11:48 + [2025-09-09 17:47:56] iteration 485/ 11920 | consumed samples: 496640 | elapsed time per iteration (ms): 5718.5 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.313938E+00 | loss scale: 1.0 | grad norm: 0.311 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:09:51.567070 | finish at 2025-09-10 11:57:47 + [2025-09-09 17:48:02] iteration 486/ 11920 | consumed samples: 497664 | elapsed time per iteration (ms): 5920.9 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.319102E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:48:19.201070 | finish at 2025-09-10 12:36:21 + [2025-09-09 17:48:08] iteration 487/ 11920 | consumed samples: 498688 | elapsed time per iteration (ms): 5763.4 | throughput per GPU (TFLOP/s/GPU): 78.3 | MFU 7.92% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.305236E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:18:12.541488 | finish at 2025-09-10 12:06:20 + [2025-09-09 17:48:13] iteration 488/ 11920 | consumed samples: 499712 | elapsed time per iteration (ms): 5713.4 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.306082E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:08:35.960917 | finish at 2025-09-10 11:56:49 + [2025-09-09 17:48:19] iteration 489/ 11920 | consumed samples: 500736 | elapsed time per iteration (ms): 5727.1 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.299384E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:11:05.982889 | finish at 2025-09-10 11:59:25 + [2025-09-09 17:48:25] iteration 490/ 11920 | consumed samples: 501760 | elapsed time per iteration (ms): 5968.4 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.303957E+00 | loss scale: 1.0 | grad norm: 0.130 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:56:59.204435 | finish at 2025-09-10 12:45:24 + [2025-09-09 17:48:31] iteration 491/ 11920 | consumed samples: 502784 | elapsed time per iteration (ms): 5735.8 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.267252E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:12:34.226903 | finish at 2025-09-10 12:01:05 + [2025-09-09 17:48:37] iteration 492/ 11920 | consumed samples: 503808 | elapsed time per iteration (ms): 5958.6 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.278102E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:54:55.401713 | finish at 2025-09-10 12:43:32 + [2025-09-09 17:48:42] iteration 493/ 11920 | consumed samples: 504832 | elapsed time per iteration (ms): 5723.5 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.286247E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:10:02.783377 | finish at 2025-09-10 11:58:45 + [2025-09-09 17:48:48] iteration 494/ 11920 | consumed samples: 505856 | elapsed time per iteration (ms): 5724.5 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.301480E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:10:07.684112 | finish at 2025-09-10 11:58:56 + [2025-09-09 17:48:54] iteration 495/ 11920 | consumed samples: 506880 | elapsed time per iteration (ms): 5740.0 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.283673E+00 | loss scale: 1.0 | grad norm: 0.270 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:13:00.020380 | finish at 2025-09-10 12:01:54 + [2025-09-09 17:49:00] iteration 496/ 11920 | consumed samples: 507904 | elapsed time per iteration (ms): 5712.6 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.277297E+00 | loss scale: 1.0 | grad norm: 0.264 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:07:40.957489 | finish at 2025-09-10 11:56:41 + [2025-09-09 17:49:05] iteration 497/ 11920 | consumed samples: 508928 | elapsed time per iteration (ms): 5745.4 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.271966E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:13:49.861611 | finish at 2025-09-10 12:02:55 + [2025-09-09 17:49:11] iteration 498/ 11920 | consumed samples: 509952 | elapsed time per iteration (ms): 6072.2 | throughput per GPU (TFLOP/s/GPU): 74.4 | MFU 7.52% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.257545E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:15:57.050334 | finish at 2025-09-10 13:05:08 + [2025-09-09 17:49:17] iteration 499/ 11920 | consumed samples: 510976 | elapsed time per iteration (ms): 5948.0 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.254851E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:52:12.407005 | finish at 2025-09-10 12:41:30 + [2025-09-09 17:49:23] iteration 500/ 11920 | consumed samples: 512000 | elapsed time per iteration (ms): 5721.6 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.240034E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:09:00.302343 | finish at 2025-09-10 11:58:23 + [2025-09-09 17:49:29] iteration 501/ 11920 | consumed samples: 513024 | elapsed time per iteration (ms): 5742.5 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.245496E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:12:53.962188 | finish at 2025-09-10 12:02:23 + [2025-09-09 17:49:35] iteration 502/ 11920 | consumed samples: 514048 | elapsed time per iteration (ms): 5707.6 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 8.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.248914E+00 | loss scale: 1.0 | grad norm: 0.244 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:06:09.059628 | finish at 2025-09-10 11:55:44 + [2025-09-09 17:49:40] iteration 503/ 11920 | consumed samples: 515072 | elapsed time per iteration (ms): 5758.8 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.244141E+00 | loss scale: 1.0 | grad norm: 0.241 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:15:48.007622 | finish at 2025-09-10 12:05:28 + [2025-09-09 17:49:46] iteration 504/ 11920 | consumed samples: 516096 | elapsed time per iteration (ms): 5716.5 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.230276E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:07:39.782244 | finish at 2025-09-10 11:57:26 + [2025-09-09 17:49:52] iteration 505/ 11920 | consumed samples: 517120 | elapsed time per iteration (ms): 5727.6 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.218524E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:09:41.042272 | finish at 2025-09-10 11:59:33 + [2025-09-09 17:49:57] iteration 506/ 11920 | consumed samples: 518144 | elapsed time per iteration (ms): 5715.8 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.215280E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:07:20.451965 | finish at 2025-09-10 11:57:18 + [2025-09-09 17:50:03] iteration 507/ 11920 | consumed samples: 519168 | elapsed time per iteration (ms): 5748.3 | throughput per GPU (TFLOP/s/GPU): 78.5 | MFU 7.94% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.221540E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:13:25.492979 | finish at 2025-09-10 12:03:29 + [2025-09-09 17:50:09] iteration 508/ 11920 | consumed samples: 520192 | elapsed time per iteration (ms): 5740.1 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.202379E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:11:46.090879 | finish at 2025-09-10 12:01:55 + [2025-09-09 17:50:15] iteration 509/ 11920 | consumed samples: 521216 | elapsed time per iteration (ms): 5757.0 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.223395E+00 | loss scale: 1.0 | grad norm: 0.238 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:14:53.061357 | finish at 2025-09-10 12:05:08 + [2025-09-09 17:50:21] iteration 510/ 11920 | consumed samples: 522240 | elapsed time per iteration (ms): 5998.3 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.200730E+00 | loss scale: 1.0 | grad norm: 0.306 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:00:40.960228 | finish at 2025-09-10 12:51:02 + [2025-09-09 17:50:26] iteration 511/ 11920 | consumed samples: 523264 | elapsed time per iteration (ms): 5729.2 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.227373E+00 | loss scale: 1.0 | grad norm: 0.347 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:09:24.338139 | finish at 2025-09-10 11:59:51 + [2025-09-09 17:50:32] iteration 512/ 11920 | consumed samples: 524288 | elapsed time per iteration (ms): 5710.9 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.214289E+00 | loss scale: 1.0 | grad norm: 0.343 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:05:50.312443 | finish at 2025-09-10 11:56:22 + [2025-09-09 17:50:38] iteration 513/ 11920 | consumed samples: 525312 | elapsed time per iteration (ms): 5937.7 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.211411E+00 | loss scale: 1.0 | grad norm: 0.269 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:48:50.841145 | finish at 2025-09-10 12:39:29 + [2025-09-09 17:50:44] iteration 514/ 11920 | consumed samples: 526336 | elapsed time per iteration (ms): 5728.9 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.200197E+00 | loss scale: 1.0 | grad norm: 0.283 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:09:03.359719 | finish at 2025-09-10 11:59:47 + [2025-09-09 17:50:50] iteration 515/ 11920 | consumed samples: 527360 | elapsed time per iteration (ms): 5743.0 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.210006E+00 | loss scale: 1.0 | grad norm: 0.241 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:11:38.986046 | finish at 2025-09-10 12:02:29 + [2025-09-09 17:50:55] iteration 516/ 11920 | consumed samples: 528384 | elapsed time per iteration (ms): 5705.8 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 8.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.208084E+00 | loss scale: 1.0 | grad norm: 0.241 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:04:29.011817 | finish at 2025-09-10 11:55:24 + [2025-09-09 17:51:01] iteration 517/ 11920 | consumed samples: 529408 | elapsed time per iteration (ms): 5725.3 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.193108E+00 | loss scale: 1.0 | grad norm: 0.262 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:08:05.444493 | finish at 2025-09-10 11:59:06 + [2025-09-09 17:51:07] iteration 518/ 11920 | consumed samples: 530432 | elapsed time per iteration (ms): 5933.4 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.200868E+00 | loss scale: 1.0 | grad norm: 0.274 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:47:32.508945 | finish at 2025-09-10 12:38:39 + [2025-09-09 17:51:13] iteration 519/ 11920 | consumed samples: 531456 | elapsed time per iteration (ms): 5721.3 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.207294E+00 | loss scale: 1.0 | grad norm: 0.320 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:07:08.917840 | finish at 2025-09-10 11:58:22 + [2025-09-09 17:51:18] iteration 520/ 11920 | consumed samples: 532480 | elapsed time per iteration (ms): 5706.8 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 8.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.205308E+00 | loss scale: 1.0 | grad norm: 0.276 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:04:17.799768 | finish at 2025-09-10 11:55:36 + [2025-09-09 17:51:24] iteration 521/ 11920 | consumed samples: 533504 | elapsed time per iteration (ms): 5726.4 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.181069E+00 | loss scale: 1.0 | grad norm: 0.265 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:07:55.398225 | finish at 2025-09-10 11:59:19 + [2025-09-09 17:51:30] iteration 522/ 11920 | consumed samples: 534528 | elapsed time per iteration (ms): 5713.1 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.186460E+00 | loss scale: 1.0 | grad norm: 0.267 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:05:17.416003 | finish at 2025-09-10 11:56:47 + [2025-09-09 17:51:36] iteration 523/ 11920 | consumed samples: 535552 | elapsed time per iteration (ms): 5961.2 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.182032E+00 | loss scale: 1.0 | grad norm: 0.267 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:52:19.570855 | finish at 2025-09-10 12:43:55 + [2025-09-09 17:51:41] iteration 524/ 11920 | consumed samples: 536576 | elapsed time per iteration (ms): 5706.9 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 8.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.186431E+00 | loss scale: 1.0 | grad norm: 0.251 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:03:56.181543 | finish at 2025-09-10 11:55:38 + [2025-09-09 17:51:47] iteration 525/ 11920 | consumed samples: 537600 | elapsed time per iteration (ms): 5711.6 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.179875E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:04:43.514303 | finish at 2025-09-10 11:56:31 + [2025-09-09 17:51:53] iteration 526/ 11920 | consumed samples: 538624 | elapsed time per iteration (ms): 5699.3 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.163294E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:02:17.765014 | finish at 2025-09-10 11:54:11 + [2025-09-09 17:51:59] iteration 527/ 11920 | consumed samples: 539648 | elapsed time per iteration (ms): 5729.0 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.150837E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:07:50.158505 | finish at 2025-09-10 11:59:49 + [2025-09-09 17:52:04] iteration 528/ 11920 | consumed samples: 540672 | elapsed time per iteration (ms): 5699.4 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.158041E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:02:07.260010 | finish at 2025-09-10 11:54:12 + [2025-09-09 17:52:10] iteration 529/ 11920 | consumed samples: 541696 | elapsed time per iteration (ms): 5729.1 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.155537E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:07:39.786895 | finish at 2025-09-10 11:59:50 + [2025-09-09 17:52:16] iteration 530/ 11920 | consumed samples: 542720 | elapsed time per iteration (ms): 5931.8 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.125437E+00 | loss scale: 1.0 | grad norm: 0.112 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:46:03.241465 | finish at 2025-09-10 12:38:19 + [2025-09-09 17:52:22] iteration 531/ 11920 | consumed samples: 543744 | elapsed time per iteration (ms): 6061.9 | throughput per GPU (TFLOP/s/GPU): 74.5 | MFU 7.53% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.114467E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:10:39.176187 | finish at 2025-09-10 13:03:01 + [2025-09-09 17:52:28] iteration 532/ 11920 | consumed samples: 544768 | elapsed time per iteration (ms): 5703.7 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.107660E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:02:33.250342 | finish at 2025-09-10 11:55:01 + [2025-09-09 17:52:33] iteration 533/ 11920 | consumed samples: 545792 | elapsed time per iteration (ms): 5722.1 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.115177E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:05:57.411748 | finish at 2025-09-10 11:58:31 + [2025-09-09 17:52:39] iteration 534/ 11920 | consumed samples: 546816 | elapsed time per iteration (ms): 5697.2 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.116269E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:01:08.501762 | finish at 2025-09-10 11:53:48 + [2025-09-09 17:52:45] iteration 535/ 11920 | consumed samples: 547840 | elapsed time per iteration (ms): 5929.0 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.112363E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:45:01.435862 | finish at 2025-09-10 12:37:46 + [2025-09-09 17:52:51] iteration 536/ 11920 | consumed samples: 548864 | elapsed time per iteration (ms): 6243.5 | throughput per GPU (TFLOP/s/GPU): 72.3 | MFU 7.31% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.120419E+00 | loss scale: 1.0 | grad norm: 0.244 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:44:36.139643 | finish at 2025-09-10 13:37:27 + [2025-09-09 17:52:57] iteration 537/ 11920 | consumed samples: 549888 | elapsed time per iteration (ms): 5986.7 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.111046E+00 | loss scale: 1.0 | grad norm: 0.296 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:55:46.047693 | finish at 2025-09-10 12:48:43 + [2025-09-09 17:53:03] iteration 538/ 11920 | consumed samples: 550912 | elapsed time per iteration (ms): 5934.5 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.101763E+00 | loss scale: 1.0 | grad norm: 0.241 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:45:46.755557 | finish at 2025-09-10 12:38:50 + [2025-09-09 17:53:09] iteration 539/ 11920 | consumed samples: 551936 | elapsed time per iteration (ms): 6065.1 | throughput per GPU (TFLOP/s/GPU): 74.4 | MFU 7.53% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.082115E+00 | loss scale: 1.0 | grad norm: 0.265 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:10:26.780478 | finish at 2025-09-10 13:03:36 + [2025-09-09 17:53:16] iteration 540/ 11920 | consumed samples: 552960 | elapsed time per iteration (ms): 6289.1 | throughput per GPU (TFLOP/s/GPU): 71.8 | MFU 7.26% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.098420E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:52:49.593654 | finish at 2025-09-10 13:46:05 + [2025-09-09 17:53:21] iteration 541/ 11920 | consumed samples: 553984 | elapsed time per iteration (ms): 5710.8 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.079906E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:03:03.680765 | finish at 2025-09-10 11:56:25 + [2025-09-09 17:53:27] iteration 542/ 11920 | consumed samples: 555008 | elapsed time per iteration (ms): 5977.4 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.101649E+00 | loss scale: 1.0 | grad norm: 0.249 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:53:30.312675 | finish at 2025-09-10 12:46:58 + [2025-09-09 17:53:33] iteration 543/ 11920 | consumed samples: 556032 | elapsed time per iteration (ms): 6090.9 | throughput per GPU (TFLOP/s/GPU): 74.1 | MFU 7.49% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.102075E+00 | loss scale: 1.0 | grad norm: 0.268 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:14:56.396515 | finish at 2025-09-10 13:08:30 + [2025-09-09 17:53:39] iteration 544/ 11920 | consumed samples: 557056 | elapsed time per iteration (ms): 5960.5 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.095980E+00 | loss scale: 1.0 | grad norm: 0.314 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:50:06.775497 | finish at 2025-09-10 12:43:46 + [2025-09-09 17:53:45] iteration 545/ 11920 | consumed samples: 558080 | elapsed time per iteration (ms): 5906.5 | throughput per GPU (TFLOP/s/GPU): 76.4 | MFU 7.73% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.068367E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:39:46.072528 | finish at 2025-09-10 12:33:31 + [2025-09-09 17:53:51] iteration 546/ 11920 | consumed samples: 559104 | elapsed time per iteration (ms): 5908.7 | throughput per GPU (TFLOP/s/GPU): 76.4 | MFU 7.73% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.070885E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:40:05.594355 | finish at 2025-09-10 12:33:57 + [2025-09-09 17:53:57] iteration 547/ 11920 | consumed samples: 560128 | elapsed time per iteration (ms): 6293.3 | throughput per GPU (TFLOP/s/GPU): 71.7 | MFU 7.25% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.086028E+00 | loss scale: 1.0 | grad norm: 0.244 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:52:53.550781 | finish at 2025-09-10 13:46:51 + [2025-09-09 17:54:03] iteration 548/ 11920 | consumed samples: 561152 | elapsed time per iteration (ms): 5704.5 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 8.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.064333E+00 | loss scale: 1.0 | grad norm: 0.230 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:01:11.535586 | finish at 2025-09-10 11:55:15 + [2025-09-09 17:54:09] iteration 549/ 11920 | consumed samples: 562176 | elapsed time per iteration (ms): 5723.6 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.069638E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:04:43.420578 | finish at 2025-09-10 11:58:52 + [2025-09-09 17:54:15] iteration 550/ 11920 | consumed samples: 563200 | elapsed time per iteration (ms): 5697.1 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.069485E+00 | loss scale: 1.0 | grad norm: 0.254 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:59:35.820115 | finish at 2025-09-10 11:53:50 + [2025-09-09 17:54:20] iteration 551/ 11920 | consumed samples: 564224 | elapsed time per iteration (ms): 5706.8 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 8.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.053427E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:01:21.026447 | finish at 2025-09-10 11:55:41 + [2025-09-09 17:54:26] iteration 552/ 11920 | consumed samples: 565248 | elapsed time per iteration (ms): 5722.3 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.071034E+00 | loss scale: 1.0 | grad norm: 0.375 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:04:10.592033 | finish at 2025-09-10 11:58:37 + [2025-09-09 17:54:32] iteration 553/ 11920 | consumed samples: 566272 | elapsed time per iteration (ms): 5697.7 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.058414E+00 | loss scale: 1.0 | grad norm: 0.333 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:59:26.078671 | finish at 2025-09-10 11:53:58 + [2025-09-09 17:54:37] iteration 554/ 11920 | consumed samples: 567296 | elapsed time per iteration (ms): 5712.2 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.051692E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:02:05.007986 | finish at 2025-09-10 11:56:42 + [2025-09-09 17:54:43] iteration 555/ 11920 | consumed samples: 568320 | elapsed time per iteration (ms): 5693.3 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.048440E+00 | loss scale: 1.0 | grad norm: 0.269 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:58:24.018606 | finish at 2025-09-10 11:53:07 + [2025-09-09 17:54:49] iteration 556/ 11920 | consumed samples: 569344 | elapsed time per iteration (ms): 5716.2 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.041344E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:02:39.152770 | finish at 2025-09-10 11:57:28 + [2025-09-09 17:54:54] iteration 557/ 11920 | consumed samples: 570368 | elapsed time per iteration (ms): 5684.4 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.040130E+00 | loss scale: 1.0 | grad norm: 0.242 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:56:31.317971 | finish at 2025-09-10 11:51:26 + [2025-09-09 17:55:00] iteration 558/ 11920 | consumed samples: 571392 | elapsed time per iteration (ms): 5714.4 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.047304E+00 | loss scale: 1.0 | grad norm: 0.256 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:02:06.512254 | finish at 2025-09-10 11:57:07 + [2025-09-09 17:55:06] iteration 559/ 11920 | consumed samples: 572416 | elapsed time per iteration (ms): 5695.4 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.040639E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:58:25.932374 | finish at 2025-09-10 11:53:32 + [2025-09-09 17:55:12] iteration 560/ 11920 | consumed samples: 573440 | elapsed time per iteration (ms): 5702.6 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.030368E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:59:41.969376 | finish at 2025-09-10 11:54:54 + [2025-09-09 17:55:17] iteration 561/ 11920 | consumed samples: 574464 | elapsed time per iteration (ms): 5671.6 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.039036E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:53:43.957437 | finish at 2025-09-10 11:49:01 + [2025-09-09 17:55:23] iteration 562/ 11920 | consumed samples: 575488 | elapsed time per iteration (ms): 5702.7 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.017909E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:59:31.338575 | finish at 2025-09-10 11:54:54 + [2025-09-09 17:55:29] iteration 563/ 11920 | consumed samples: 576512 | elapsed time per iteration (ms): 5674.2 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.014958E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:54:01.987536 | finish at 2025-09-10 11:49:31 + [2025-09-09 17:55:34] iteration 564/ 11920 | consumed samples: 577536 | elapsed time per iteration (ms): 5714.1 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.035031E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:01:28.963603 | finish at 2025-09-10 11:57:03 + [2025-09-09 17:55:40] iteration 565/ 11920 | consumed samples: 578560 | elapsed time per iteration (ms): 5697.5 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.019670E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:58:15.247754 | finish at 2025-09-10 11:53:55 + [2025-09-09 17:55:46] iteration 566/ 11920 | consumed samples: 579584 | elapsed time per iteration (ms): 5706.4 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 8.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.007046E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:59:51.006065 | finish at 2025-09-10 11:55:37 + [2025-09-09 17:55:51] iteration 567/ 11920 | consumed samples: 580608 | elapsed time per iteration (ms): 5683.1 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.998594E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:55:20.456086 | finish at 2025-09-10 11:51:12 + [2025-09-09 17:55:57] iteration 568/ 11920 | consumed samples: 581632 | elapsed time per iteration (ms): 5715.6 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.002146E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:01:24.032661 | finish at 2025-09-10 11:57:21 + [2025-09-09 17:56:03] iteration 569/ 11920 | consumed samples: 582656 | elapsed time per iteration (ms): 5690.8 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.989117E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:56:35.793943 | finish at 2025-09-10 11:52:39 + [2025-09-09 17:56:09] iteration 570/ 11920 | consumed samples: 583680 | elapsed time per iteration (ms): 5681.2 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.994220E+00 | loss scale: 1.0 | grad norm: 0.265 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:54:41.907153 | finish at 2025-09-10 11:50:50 + [2025-09-09 17:56:14] iteration 571/ 11920 | consumed samples: 584704 | elapsed time per iteration (ms): 5674.1 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.995911E+00 | loss scale: 1.0 | grad norm: 0.278 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:53:15.714478 | finish at 2025-09-10 11:49:30 + [2025-09-09 17:56:20] iteration 572/ 11920 | consumed samples: 585728 | elapsed time per iteration (ms): 5697.0 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.994302E+00 | loss scale: 1.0 | grad norm: 0.248 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:57:30.032484 | finish at 2025-09-10 11:53:50 + [2025-09-09 17:56:26] iteration 573/ 11920 | consumed samples: 586752 | elapsed time per iteration (ms): 5661.4 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.999343E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:50:40.066319 | finish at 2025-09-10 11:47:06 + [2025-09-09 17:56:31] iteration 574/ 11920 | consumed samples: 587776 | elapsed time per iteration (ms): 5692.5 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.996166E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:56:27.501243 | finish at 2025-09-10 11:52:59 + [2025-09-09 17:56:37] iteration 575/ 11920 | consumed samples: 588800 | elapsed time per iteration (ms): 5687.5 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.981399E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:55:24.490045 | finish at 2025-09-10 11:52:01 + [2025-09-09 17:56:43] iteration 576/ 11920 | consumed samples: 589824 | elapsed time per iteration (ms): 5694.6 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.984982E+00 | loss scale: 1.0 | grad norm: 0.238 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:56:39.072990 | finish at 2025-09-10 11:53:22 + [2025-09-09 17:56:48] iteration 577/ 11920 | consumed samples: 590848 | elapsed time per iteration (ms): 5686.4 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.986764E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:55:00.582974 | finish at 2025-09-10 11:51:49 + [2025-09-09 17:56:54] iteration 578/ 11920 | consumed samples: 591872 | elapsed time per iteration (ms): 5687.2 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.965045E+00 | loss scale: 1.0 | grad norm: 0.243 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:55:03.768891 | finish at 2025-09-10 11:51:58 + [2025-09-09 17:57:00] iteration 579/ 11920 | consumed samples: 592896 | elapsed time per iteration (ms): 5694.5 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.964587E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:56:21.134881 | finish at 2025-09-10 11:53:21 + [2025-09-09 17:57:05] iteration 580/ 11920 | consumed samples: 593920 | elapsed time per iteration (ms): 5664.9 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.970851E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:50:39.807215 | finish at 2025-09-10 11:47:45 + [2025-09-09 17:57:11] iteration 581/ 11920 | consumed samples: 594944 | elapsed time per iteration (ms): 5674.6 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.965899E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:52:24.547636 | finish at 2025-09-10 11:49:36 + [2025-09-09 17:57:17] iteration 582/ 11920 | consumed samples: 595968 | elapsed time per iteration (ms): 5675.1 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.958873E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:52:23.776599 | finish at 2025-09-10 11:49:40 + [2025-09-09 17:57:22] iteration 583/ 11920 | consumed samples: 596992 | elapsed time per iteration (ms): 5674.2 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.969452E+00 | loss scale: 1.0 | grad norm: 0.330 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:52:08.524987 | finish at 2025-09-10 11:49:31 + [2025-09-09 17:57:28] iteration 584/ 11920 | consumed samples: 598016 | elapsed time per iteration (ms): 5685.2 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.988975E+00 | loss scale: 1.0 | grad norm: 0.347 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:54:07.956659 | finish at 2025-09-10 11:51:36 + [2025-09-09 17:57:34] iteration 585/ 11920 | consumed samples: 599040 | elapsed time per iteration (ms): 5711.0 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.973320E+00 | loss scale: 1.0 | grad norm: 0.333 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:58:54.081917 | finish at 2025-09-10 11:56:28 + [2025-09-09 17:57:39] iteration 586/ 11920 | consumed samples: 600064 | elapsed time per iteration (ms): 5688.7 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.973881E+00 | loss scale: 1.0 | grad norm: 0.292 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:54:35.706439 | finish at 2025-09-10 11:52:15 + [2025-09-09 17:57:45] iteration 587/ 11920 | consumed samples: 601088 | elapsed time per iteration (ms): 5704.8 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 8.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.974673E+00 | loss scale: 1.0 | grad norm: 0.245 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:57:32.905161 | finish at 2025-09-10 11:55:18 + [2025-09-09 17:57:51] iteration 588/ 11920 | consumed samples: 602112 | elapsed time per iteration (ms): 5677.8 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.983175E+00 | loss scale: 1.0 | grad norm: 0.366 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:52:20.453377 | finish at 2025-09-10 11:50:11 + [2025-09-09 17:57:57] iteration 589/ 11920 | consumed samples: 603136 | elapsed time per iteration (ms): 5683.0 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.979072E+00 | loss scale: 1.0 | grad norm: 0.359 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:53:14.411684 | finish at 2025-09-10 11:51:11 + [2025-09-09 17:58:02] iteration 590/ 11920 | consumed samples: 604160 | elapsed time per iteration (ms): 5674.1 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.974464E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:51:27.889779 | finish at 2025-09-10 11:49:30 + [2025-09-09 17:58:08] iteration 591/ 11920 | consumed samples: 605184 | elapsed time per iteration (ms): 5678.8 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.957129E+00 | loss scale: 1.0 | grad norm: 0.364 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:52:15.355990 | finish at 2025-09-10 11:50:23 + [2025-09-09 17:58:14] iteration 592/ 11920 | consumed samples: 606208 | elapsed time per iteration (ms): 5679.6 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.947958E+00 | loss scale: 1.0 | grad norm: 0.261 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:52:18.616837 | finish at 2025-09-10 11:50:32 + [2025-09-09 17:58:19] iteration 593/ 11920 | consumed samples: 607232 | elapsed time per iteration (ms): 5692.3 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.949463E+00 | loss scale: 1.0 | grad norm: 0.266 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:54:36.880162 | finish at 2025-09-10 11:52:56 + [2025-09-09 17:58:25] iteration 594/ 11920 | consumed samples: 608256 | elapsed time per iteration (ms): 5667.5 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.948403E+00 | loss scale: 1.0 | grad norm: 0.256 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:49:50.161923 | finish at 2025-09-10 11:48:15 + [2025-09-09 17:58:31] iteration 595/ 11920 | consumed samples: 609280 | elapsed time per iteration (ms): 5678.6 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.923505E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:51:49.775913 | finish at 2025-09-10 11:50:20 + [2025-09-09 17:58:36] iteration 596/ 11920 | consumed samples: 610304 | elapsed time per iteration (ms): 5682.3 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.934001E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:52:26.368928 | finish at 2025-09-10 11:51:03 + [2025-09-09 17:58:42] iteration 597/ 11920 | consumed samples: 611328 | elapsed time per iteration (ms): 5692.7 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.917682E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:54:17.893050 | finish at 2025-09-10 11:53:00 + [2025-09-09 17:58:48] iteration 598/ 11920 | consumed samples: 612352 | elapsed time per iteration (ms): 5667.8 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.928430E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:49:30.901214 | finish at 2025-09-10 11:48:19 + [2025-09-09 17:58:54] iteration 599/ 11920 | consumed samples: 613376 | elapsed time per iteration (ms): 5911.4 | throughput per GPU (TFLOP/s/GPU): 76.4 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.914660E+00 | loss scale: 1.0 | grad norm: 0.136 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:35:23.089861 | finish at 2025-09-10 12:34:17 + [2025-09-09 17:58:59] iteration 600/ 11920 | consumed samples: 614400 | elapsed time per iteration (ms): 5673.0 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.918799E+00 | loss scale: 1.0 | grad norm: 0.135 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:50:18.077717 | finish at 2025-09-10 11:49:17 + [2025-09-09 17:59:05] iteration 601/ 11920 | consumed samples: 615424 | elapsed time per iteration (ms): 5693.3 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.926535E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:54:02.308976 | finish at 2025-09-10 11:53:07 + [2025-09-09 17:59:11] iteration 602/ 11920 | consumed samples: 616448 | elapsed time per iteration (ms): 5676.4 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.896896E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:50:45.791418 | finish at 2025-09-10 11:49:56 + [2025-09-09 17:59:17] iteration 603/ 11920 | consumed samples: 617472 | elapsed time per iteration (ms): 5904.2 | throughput per GPU (TFLOP/s/GPU): 76.5 | MFU 7.73% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.897099E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:33:37.816083 | finish at 2025-09-10 12:32:54 + [2025-09-09 17:59:22] iteration 604/ 11920 | consumed samples: 618496 | elapsed time per iteration (ms): 5688.2 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.906105E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:52:48.056972 | finish at 2025-09-10 11:52:10 + [2025-09-09 17:59:28] iteration 605/ 11920 | consumed samples: 619520 | elapsed time per iteration (ms): 5682.6 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.883279E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:51:38.489752 | finish at 2025-09-10 11:51:06 + [2025-09-09 17:59:34] iteration 606/ 11920 | consumed samples: 620544 | elapsed time per iteration (ms): 5672.4 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.878004E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:49:37.727792 | finish at 2025-09-10 11:49:11 + [2025-09-09 17:59:39] iteration 607/ 11920 | consumed samples: 621568 | elapsed time per iteration (ms): 5697.2 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.883555E+00 | loss scale: 1.0 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:54:12.653542 | finish at 2025-09-10 11:53:52 + [2025-09-09 17:59:45] iteration 608/ 11920 | consumed samples: 622592 | elapsed time per iteration (ms): 5698.8 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.874987E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:54:25.090889 | finish at 2025-09-10 11:54:10 + [2025-09-09 17:59:51] iteration 609/ 11920 | consumed samples: 623616 | elapsed time per iteration (ms): 5696.7 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.891714E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:53:55.337033 | finish at 2025-09-10 11:53:46 + [2025-09-09 17:59:56] iteration 610/ 11920 | consumed samples: 624640 | elapsed time per iteration (ms): 5677.8 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.881866E+00 | loss scale: 1.0 | grad norm: 0.328 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:50:15.477791 | finish at 2025-09-10 11:50:12 + [2025-09-09 18:00:02] iteration 611/ 11920 | consumed samples: 625664 | elapsed time per iteration (ms): 5694.5 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.904542E+00 | loss scale: 1.0 | grad norm: 0.372 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:53:19.558522 | finish at 2025-09-10 11:53:22 + [2025-09-09 18:00:08] iteration 612/ 11920 | consumed samples: 626688 | elapsed time per iteration (ms): 5664.9 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.892527E+00 | loss scale: 1.0 | grad norm: 0.245 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:47:38.431109 | finish at 2025-09-10 11:47:46 + [2025-09-09 18:00:13] iteration 613/ 11920 | consumed samples: 627712 | elapsed time per iteration (ms): 5698.3 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.896730E+00 | loss scale: 1.0 | grad norm: 0.306 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:53:50.590532 | finish at 2025-09-10 11:54:04 + [2025-09-09 18:00:19] iteration 614/ 11920 | consumed samples: 628736 | elapsed time per iteration (ms): 5669.9 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.883706E+00 | loss scale: 1.0 | grad norm: 0.276 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:48:24.322712 | finish at 2025-09-10 11:48:43 + [2025-09-09 18:00:25] iteration 615/ 11920 | consumed samples: 629760 | elapsed time per iteration (ms): 5684.4 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.892637E+00 | loss scale: 1.0 | grad norm: 0.267 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:51:02.460971 | finish at 2025-09-10 11:51:27 + [2025-09-09 18:00:30] iteration 616/ 11920 | consumed samples: 630784 | elapsed time per iteration (ms): 5685.0 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.887823E+00 | loss scale: 1.0 | grad norm: 0.247 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:51:03.759504 | finish at 2025-09-10 11:51:34 + [2025-09-09 18:00:36] iteration 617/ 11920 | consumed samples: 631808 | elapsed time per iteration (ms): 5677.3 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.892385E+00 | loss scale: 1.0 | grad norm: 0.305 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:49:30.597088 | finish at 2025-09-10 11:50:07 + [2025-09-09 18:00:42] iteration 618/ 11920 | consumed samples: 632832 | elapsed time per iteration (ms): 5698.7 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.903566E+00 | loss scale: 1.0 | grad norm: 0.310 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:53:26.448166 | finish at 2025-09-10 11:54:08 + [2025-09-09 18:00:47] iteration 619/ 11920 | consumed samples: 633856 | elapsed time per iteration (ms): 5673.0 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.890499E+00 | loss scale: 1.0 | grad norm: 0.292 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:48:30.684568 | finish at 2025-09-10 11:49:18 + [2025-09-09 18:00:53] iteration 620/ 11920 | consumed samples: 634880 | elapsed time per iteration (ms): 5679.6 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.878165E+00 | loss scale: 1.0 | grad norm: 0.311 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:49:39.057026 | finish at 2025-09-10 11:50:32 + [2025-09-09 18:00:59] iteration 621/ 11920 | consumed samples: 635904 | elapsed time per iteration (ms): 5696.5 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.917312E+00 | loss scale: 1.0 | grad norm: 0.348 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:52:45.061315 | finish at 2025-09-10 11:53:44 + [2025-09-09 18:01:05] iteration 622/ 11920 | consumed samples: 636928 | elapsed time per iteration (ms): 5928.7 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.909210E+00 | loss scale: 1.0 | grad norm: 0.302 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:36:22.716242 | finish at 2025-09-10 12:37:27 + [2025-09-09 18:01:10] iteration 623/ 11920 | consumed samples: 637952 | elapsed time per iteration (ms): 5675.3 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.909174E+00 | loss scale: 1.0 | grad norm: 0.267 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:48:33.711946 | finish at 2025-09-10 11:49:44 + [2025-09-09 18:01:16] iteration 624/ 11920 | consumed samples: 638976 | elapsed time per iteration (ms): 5665.8 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.909969E+00 | loss scale: 1.0 | grad norm: 0.280 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:46:40.312302 | finish at 2025-09-10 11:47:56 + [2025-09-09 18:01:22] iteration 625/ 11920 | consumed samples: 640000 | elapsed time per iteration (ms): 5870.3 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.904805E+00 | loss scale: 1.0 | grad norm: 0.243 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:25:04.996029 | finish at 2025-09-10 12:26:27 + [2025-09-09 18:01:28] iteration 626/ 11920 | consumed samples: 641024 | elapsed time per iteration (ms): 5677.3 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.906238E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:48:39.582109 | finish at 2025-09-10 11:50:07 + [2025-09-09 18:01:33] iteration 627/ 11920 | consumed samples: 642048 | elapsed time per iteration (ms): 5682.4 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.877144E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:49:31.598850 | finish at 2025-09-10 11:51:05 + [2025-09-09 18:01:40] iteration 628/ 11920 | consumed samples: 643072 | elapsed time per iteration (ms): 6221.1 | throughput per GPU (TFLOP/s/GPU): 72.6 | MFU 7.34% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.880281E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:30:48.600317 | finish at 2025-09-10 13:32:28 + [2025-09-09 18:01:45] iteration 629/ 11920 | consumed samples: 644096 | elapsed time per iteration (ms): 5898.8 | throughput per GPU (TFLOP/s/GPU): 76.5 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.880610E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:30:03.365780 | finish at 2025-09-10 12:31:49 + [2025-09-09 18:01:51] iteration 630/ 11920 | consumed samples: 645120 | elapsed time per iteration (ms): 5889.8 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.869336E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:28:15.509033 | finish at 2025-09-10 12:30:07 + [2025-09-09 18:01:57] iteration 631/ 11920 | consumed samples: 646144 | elapsed time per iteration (ms): 5685.2 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.860123E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:49:40.378636 | finish at 2025-09-10 11:51:37 + [2025-09-09 18:02:03] iteration 632/ 11920 | consumed samples: 647168 | elapsed time per iteration (ms): 5677.6 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.866101E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:48:08.284851 | finish at 2025-09-10 11:50:11 + [2025-09-09 18:02:09] iteration 633/ 11920 | consumed samples: 648192 | elapsed time per iteration (ms): 5922.8 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.849820E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:34:10.854224 | finish at 2025-09-10 12:36:20 + [2025-09-09 18:02:14] iteration 634/ 11920 | consumed samples: 649216 | elapsed time per iteration (ms): 5667.2 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.833508E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:45:59.616580 | finish at 2025-09-10 11:48:14 + [2025-09-09 18:02:20] iteration 635/ 11920 | consumed samples: 650240 | elapsed time per iteration (ms): 5895.4 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.839920E+00 | loss scale: 1.0 | grad norm: 0.131 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:28:49.242452 | finish at 2025-09-10 12:31:09 + [2025-09-09 18:02:26] iteration 636/ 11920 | consumed samples: 651264 | elapsed time per iteration (ms): 5689.5 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.847050E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:50:00.383622 | finish at 2025-09-10 11:52:26 + [2025-09-09 18:02:32] iteration 637/ 11920 | consumed samples: 652288 | elapsed time per iteration (ms): 5690.6 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.841516E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:50:07.216424 | finish at 2025-09-10 11:52:39 + [2025-09-09 18:02:37] iteration 638/ 11920 | consumed samples: 653312 | elapsed time per iteration (ms): 5687.8 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.834232E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:49:30.178432 | finish at 2025-09-10 11:52:07 + [2025-09-09 18:02:43] iteration 639/ 11920 | consumed samples: 654336 | elapsed time per iteration (ms): 5708.0 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 8.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.833736E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:53:11.638072 | finish at 2025-09-10 11:55:55 + [2025-09-09 18:02:49] iteration 640/ 11920 | consumed samples: 655360 | elapsed time per iteration (ms): 5673.6 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.828066E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:46:38.651276 | finish at 2025-09-10 11:49:27 + [2025-09-09 18:02:54] iteration 641/ 11920 | consumed samples: 656384 | elapsed time per iteration (ms): 5696.7 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.833428E+00 | loss scale: 1.0 | grad norm: 0.278 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:50:53.419214 | finish at 2025-09-10 11:53:48 + [2025-09-09 18:03:00] iteration 642/ 11920 | consumed samples: 657408 | elapsed time per iteration (ms): 5687.7 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.840726E+00 | loss scale: 1.0 | grad norm: 0.250 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:49:06.235907 | finish at 2025-09-10 11:52:06 + [2025-09-09 18:03:06] iteration 643/ 11920 | consumed samples: 658432 | elapsed time per iteration (ms): 6088.8 | throughput per GPU (TFLOP/s/GPU): 74.2 | MFU 7.50% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.835482E+00 | loss scale: 1.0 | grad norm: 0.318 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:04:23.302973 | finish at 2025-09-10 13:07:29 + [2025-09-09 18:03:12] iteration 644/ 11920 | consumed samples: 659456 | elapsed time per iteration (ms): 5932.4 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.824655E+00 | loss scale: 1.0 | grad norm: 0.290 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:34:53.589027 | finish at 2025-09-10 12:38:06 + [2025-09-09 18:03:18] iteration 645/ 11920 | consumed samples: 660480 | elapsed time per iteration (ms): 5680.5 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.822571E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:47:27.869027 | finish at 2025-09-10 11:50:46 + [2025-09-09 18:03:23] iteration 646/ 11920 | consumed samples: 661504 | elapsed time per iteration (ms): 5673.2 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.838523E+00 | loss scale: 1.0 | grad norm: 0.256 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:45:59.787292 | finish at 2025-09-10 11:49:23 + [2025-09-09 18:03:29] iteration 647/ 11920 | consumed samples: 662528 | elapsed time per iteration (ms): 5676.0 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.829480E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:46:25.750910 | finish at 2025-09-10 11:49:55 + [2025-09-09 18:03:35] iteration 648/ 11920 | consumed samples: 663552 | elapsed time per iteration (ms): 5691.0 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.823262E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:49:09.449007 | finish at 2025-09-10 11:52:44 + [2025-09-09 18:03:40] iteration 649/ 11920 | consumed samples: 664576 | elapsed time per iteration (ms): 5698.0 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.828176E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:50:21.972066 | finish at 2025-09-10 11:54:02 + [2025-09-09 18:03:47] iteration 650/ 11920 | consumed samples: 665600 | elapsed time per iteration (ms): 6067.4 | throughput per GPU (TFLOP/s/GPU): 74.4 | MFU 7.52% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.859217E+00 | loss scale: 1.0 | grad norm: 0.393 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:59:40.003493 | finish at 2025-09-10 13:03:27 + [2025-09-09 18:03:52] iteration 651/ 11920 | consumed samples: 666624 | elapsed time per iteration (ms): 5676.6 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.824895E+00 | loss scale: 1.0 | grad norm: 0.240 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:46:09.975938 | finish at 2025-09-10 11:50:02 + [2025-09-09 18:03:58] iteration 652/ 11920 | consumed samples: 667648 | elapsed time per iteration (ms): 6238.3 | throughput per GPU (TFLOP/s/GPU): 72.4 | MFU 7.32% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.806307E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:31:32.692534 | finish at 2025-09-10 13:35:31 + [2025-09-09 18:04:05] iteration 653/ 11920 | consumed samples: 668672 | elapsed time per iteration (ms): 6161.4 | throughput per GPU (TFLOP/s/GPU): 73.3 | MFU 7.41% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.815627E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:17:00.215327 | finish at 2025-09-10 13:21:05 + [2025-09-09 18:04:10] iteration 654/ 11920 | consumed samples: 669696 | elapsed time per iteration (ms): 5688.6 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.805747E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:48:08.187333 | finish at 2025-09-10 11:52:19 + [2025-09-09 18:04:16] iteration 655/ 11920 | consumed samples: 670720 | elapsed time per iteration (ms): 5900.8 | throughput per GPU (TFLOP/s/GPU): 76.5 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.803294E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:27:53.065156 | finish at 2025-09-10 12:32:09 + [2025-09-09 18:04:23] iteration 656/ 11920 | consumed samples: 671744 | elapsed time per iteration (ms): 6462.4 | throughput per GPU (TFLOP/s/GPU): 69.9 | MFU 7.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.813660E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 20:13:12.768555 | finish at 2025-09-10 14:17:35 + [2025-09-09 18:04:28] iteration 657/ 11920 | consumed samples: 672768 | elapsed time per iteration (ms): 5693.3 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.818066E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:48:43.106308 | finish at 2025-09-10 11:53:11 + [2025-09-09 18:04:34] iteration 658/ 11920 | consumed samples: 673792 | elapsed time per iteration (ms): 5701.7 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.798105E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:50:12.145011 | finish at 2025-09-10 11:54:46 + [2025-09-09 18:04:40] iteration 659/ 11920 | consumed samples: 674816 | elapsed time per iteration (ms): 6379.0 | throughput per GPU (TFLOP/s/GPU): 70.8 | MFU 7.16% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.815793E+00 | loss scale: 1.0 | grad norm: 0.274 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:57:13.547913 | finish at 2025-09-10 14:01:54 + [2025-09-09 18:04:46] iteration 660/ 11920 | consumed samples: 675840 | elapsed time per iteration (ms): 5691.7 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.802102E+00 | loss scale: 1.0 | grad norm: 0.328 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:48:09.011598 | finish at 2025-09-10 11:52:55 + [2025-09-09 18:04:52] iteration 661/ 11920 | consumed samples: 676864 | elapsed time per iteration (ms): 5696.3 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.811245E+00 | loss scale: 1.0 | grad norm: 0.321 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:48:54.693038 | finish at 2025-09-10 11:53:47 + [2025-09-09 18:04:58] iteration 662/ 11920 | consumed samples: 677888 | elapsed time per iteration (ms): 5693.7 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.818665E+00 | loss scale: 1.0 | grad norm: 0.317 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:48:20.005592 | finish at 2025-09-10 11:53:18 + [2025-09-09 18:05:03] iteration 663/ 11920 | consumed samples: 678912 | elapsed time per iteration (ms): 5676.2 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.800788E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:44:56.609362 | finish at 2025-09-10 11:50:00 + [2025-09-09 18:05:09] iteration 664/ 11920 | consumed samples: 679936 | elapsed time per iteration (ms): 5701.8 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.797867E+00 | loss scale: 1.0 | grad norm: 0.230 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:49:39.421761 | finish at 2025-09-10 11:54:48 + [2025-09-09 18:05:15] iteration 665/ 11920 | consumed samples: 680960 | elapsed time per iteration (ms): 5690.8 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.782599E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:47:29.561678 | finish at 2025-09-10 11:52:44 + [2025-09-09 18:05:20] iteration 666/ 11920 | consumed samples: 681984 | elapsed time per iteration (ms): 5694.3 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.800411E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:48:03.729295 | finish at 2025-09-10 11:53:24 + [2025-09-09 18:05:26] iteration 667/ 11920 | consumed samples: 683008 | elapsed time per iteration (ms): 5683.4 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.786963E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:45:54.754618 | finish at 2025-09-10 11:51:21 + [2025-09-09 18:05:32] iteration 668/ 11920 | consumed samples: 684032 | elapsed time per iteration (ms): 5698.0 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.777904E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:48:34.383734 | finish at 2025-09-10 11:54:06 + [2025-09-09 18:05:37] iteration 669/ 11920 | consumed samples: 685056 | elapsed time per iteration (ms): 5692.9 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.773562E+00 | loss scale: 1.0 | grad norm: 0.130 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:47:30.865536 | finish at 2025-09-10 11:53:08 + [2025-09-09 18:05:43] iteration 670/ 11920 | consumed samples: 686080 | elapsed time per iteration (ms): 5692.0 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.766243E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:47:14.856856 | finish at 2025-09-10 11:52:58 + [2025-09-09 18:05:49] iteration 671/ 11920 | consumed samples: 687104 | elapsed time per iteration (ms): 5681.0 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.758832E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:45:06.081192 | finish at 2025-09-10 11:50:55 + [2025-09-09 18:05:54] iteration 672/ 11920 | consumed samples: 688128 | elapsed time per iteration (ms): 5682.9 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.760750E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:45:20.928806 | finish at 2025-09-10 11:51:15 + [2025-09-09 18:06:00] iteration 673/ 11920 | consumed samples: 689152 | elapsed time per iteration (ms): 5920.9 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.778365E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:29:51.800383 | finish at 2025-09-10 12:35:52 + [2025-09-09 18:06:06] iteration 674/ 11920 | consumed samples: 690176 | elapsed time per iteration (ms): 5685.7 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.772028E+00 | loss scale: 1.0 | grad norm: 0.289 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:45:41.609429 | finish at 2025-09-10 11:51:48 + [2025-09-09 18:06:12] iteration 675/ 11920 | consumed samples: 691200 | elapsed time per iteration (ms): 5694.0 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.791202E+00 | loss scale: 1.0 | grad norm: 0.288 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:47:08.863841 | finish at 2025-09-10 11:53:21 + [2025-09-09 18:06:17] iteration 676/ 11920 | consumed samples: 692224 | elapsed time per iteration (ms): 5683.3 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.788409E+00 | loss scale: 1.0 | grad norm: 0.306 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:45:03.314930 | finish at 2025-09-10 11:51:21 + [2025-09-09 18:06:23] iteration 677/ 11920 | consumed samples: 693248 | elapsed time per iteration (ms): 5685.7 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.769828E+00 | loss scale: 1.0 | grad norm: 0.286 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:45:24.262770 | finish at 2025-09-10 11:51:47 + [2025-09-09 18:06:29] iteration 678/ 11920 | consumed samples: 694272 | elapsed time per iteration (ms): 5685.2 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.775553E+00 | loss scale: 1.0 | grad norm: 0.247 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:45:13.020810 | finish at 2025-09-10 11:51:42 + [2025-09-09 18:06:34] iteration 679/ 11920 | consumed samples: 695296 | elapsed time per iteration (ms): 5670.1 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.782053E+00 | loss scale: 1.0 | grad norm: 0.270 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:42:17.634005 | finish at 2025-09-10 11:48:52 + [2025-09-09 18:06:40] iteration 680/ 11920 | consumed samples: 696320 | elapsed time per iteration (ms): 5691.4 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.767683E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:46:11.189184 | finish at 2025-09-10 11:52:51 + [2025-09-09 18:06:46] iteration 681/ 11920 | consumed samples: 697344 | elapsed time per iteration (ms): 5690.4 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.784900E+00 | loss scale: 1.0 | grad norm: 0.248 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:45:54.766054 | finish at 2025-09-10 11:52:41 + [2025-09-09 18:06:52] iteration 682/ 11920 | consumed samples: 698368 | elapsed time per iteration (ms): 5681.9 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.764779E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:44:13.061186 | finish at 2025-09-10 11:51:05 + [2025-09-09 18:06:57] iteration 683/ 11920 | consumed samples: 699392 | elapsed time per iteration (ms): 5678.6 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.768033E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:43:30.597803 | finish at 2025-09-10 11:50:28 + [2025-09-09 18:07:03] iteration 684/ 11920 | consumed samples: 700416 | elapsed time per iteration (ms): 5673.6 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.746848E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:42:28.820947 | finish at 2025-09-10 11:49:32 + [2025-09-09 18:07:09] iteration 685/ 11920 | consumed samples: 701440 | elapsed time per iteration (ms): 5690.2 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.759623E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:45:29.135510 | finish at 2025-09-10 11:52:38 + [2025-09-09 18:07:15] iteration 686/ 11920 | consumed samples: 702464 | elapsed time per iteration (ms): 6007.0 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.755821E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:44:43.109055 | finish at 2025-09-10 12:51:58 + [2025-09-09 18:07:20] iteration 687/ 11920 | consumed samples: 703488 | elapsed time per iteration (ms): 5677.6 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.739861E+00 | loss scale: 1.0 | grad norm: 0.133 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:42:56.088744 | finish at 2025-09-10 11:50:16 + [2025-09-09 18:07:26] iteration 688/ 11920 | consumed samples: 704512 | elapsed time per iteration (ms): 5677.1 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.740836E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:42:44.696503 | finish at 2025-09-10 11:50:11 + [2025-09-09 18:07:32] iteration 689/ 11920 | consumed samples: 705536 | elapsed time per iteration (ms): 5687.6 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.743822E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:44:36.880168 | finish at 2025-09-10 11:52:09 + [2025-09-09 18:07:37] iteration 690/ 11920 | consumed samples: 706560 | elapsed time per iteration (ms): 5688.1 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.733371E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:44:37.827315 | finish at 2025-09-10 11:52:15 + [2025-09-09 18:07:43] iteration 691/ 11920 | consumed samples: 707584 | elapsed time per iteration (ms): 5685.7 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.746876E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:44:04.542573 | finish at 2025-09-10 11:51:48 + [2025-09-09 18:07:49] iteration 692/ 11920 | consumed samples: 708608 | elapsed time per iteration (ms): 5681.0 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.741899E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:43:06.174242 | finish at 2025-09-10 11:50:55 + [2025-09-09 18:07:54] iteration 693/ 11920 | consumed samples: 709632 | elapsed time per iteration (ms): 5681.7 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.732365E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:43:07.905103 | finish at 2025-09-10 11:51:02 + [2025-09-09 18:08:00] iteration 694/ 11920 | consumed samples: 710656 | elapsed time per iteration (ms): 5687.7 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.724090E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:44:09.868983 | finish at 2025-09-10 11:52:10 + [2025-09-09 18:08:06] iteration 695/ 11920 | consumed samples: 711680 | elapsed time per iteration (ms): 5683.8 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.734685E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:43:20.622684 | finish at 2025-09-10 11:51:26 + [2025-09-09 18:08:11] iteration 696/ 11920 | consumed samples: 712704 | elapsed time per iteration (ms): 5683.0 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.735854E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:43:05.513979 | finish at 2025-09-10 11:51:17 + [2025-09-09 18:08:17] iteration 697/ 11920 | consumed samples: 713728 | elapsed time per iteration (ms): 5672.7 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.738045E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:41:05.061821 | finish at 2025-09-10 11:49:22 + [2025-09-09 18:08:23] iteration 698/ 11920 | consumed samples: 714752 | elapsed time per iteration (ms): 5683.3 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.721766E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:42:57.653013 | finish at 2025-09-10 11:51:20 + [2025-09-09 18:08:28] iteration 699/ 11920 | consumed samples: 715776 | elapsed time per iteration (ms): 5682.0 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.712722E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:42:38.216052 | finish at 2025-09-10 11:51:07 + [2025-09-09 18:08:35] iteration 700/ 11920 | consumed samples: 716800 | elapsed time per iteration (ms): 6320.8 | throughput per GPU (TFLOP/s/GPU): 71.4 | MFU 7.22% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.715632E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:41:59.772649 | finish at 2025-09-10 13:50:35 + [2025-09-09 18:08:41] iteration 701/ 11920 | consumed samples: 717824 | elapsed time per iteration (ms): 5921.5 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.743013E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:27:12.928312 | finish at 2025-09-10 12:35:54 + [2025-09-09 18:08:46] iteration 702/ 11920 | consumed samples: 718848 | elapsed time per iteration (ms): 5691.7 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.715132E+00 | loss scale: 1.0 | grad norm: 0.275 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:44:09.846114 | finish at 2025-09-10 11:52:56 + [2025-09-09 18:08:52] iteration 703/ 11920 | consumed samples: 719872 | elapsed time per iteration (ms): 5670.8 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.720903E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:40:09.689540 | finish at 2025-09-10 11:49:02 + [2025-09-09 18:08:58] iteration 704/ 11920 | consumed samples: 720896 | elapsed time per iteration (ms): 5676.5 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.717402E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:41:07.186367 | finish at 2025-09-10 11:50:05 + [2025-09-09 18:09:03] iteration 705/ 11920 | consumed samples: 721920 | elapsed time per iteration (ms): 5673.2 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.705970E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:40:25.492953 | finish at 2025-09-10 11:49:29 + [2025-09-09 18:09:09] iteration 706/ 11920 | consumed samples: 722944 | elapsed time per iteration (ms): 5675.9 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.704462E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:40:49.144033 | finish at 2025-09-10 11:49:58 + [2025-09-09 18:09:15] iteration 707/ 11920 | consumed samples: 723968 | elapsed time per iteration (ms): 5671.9 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.729062E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:39:58.624766 | finish at 2025-09-10 11:49:13 + [2025-09-09 18:09:20] iteration 708/ 11920 | consumed samples: 724992 | elapsed time per iteration (ms): 5672.0 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.725064E+00 | loss scale: 1.0 | grad norm: 0.255 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:39:54.511347 | finish at 2025-09-10 11:49:15 + [2025-09-09 18:09:26] iteration 709/ 11920 | consumed samples: 726016 | elapsed time per iteration (ms): 5672.4 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.710351E+00 | loss scale: 1.0 | grad norm: 0.270 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:39:53.396655 | finish at 2025-09-10 11:49:19 + [2025-09-09 18:09:32] iteration 710/ 11920 | consumed samples: 727040 | elapsed time per iteration (ms): 5684.0 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.713341E+00 | loss scale: 1.0 | grad norm: 0.229 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:41:57.463775 | finish at 2025-09-10 11:51:29 + [2025-09-09 18:09:37] iteration 711/ 11920 | consumed samples: 728064 | elapsed time per iteration (ms): 5687.0 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.712956E+00 | loss scale: 1.0 | grad norm: 0.246 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:42:25.404353 | finish at 2025-09-10 11:52:03 + [2025-09-09 18:09:43] iteration 712/ 11920 | consumed samples: 729088 | elapsed time per iteration (ms): 5689.3 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.716744E+00 | loss scale: 1.0 | grad norm: 0.253 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:42:45.771275 | finish at 2025-09-10 11:52:29 + [2025-09-09 18:09:49] iteration 713/ 11920 | consumed samples: 730112 | elapsed time per iteration (ms): 5676.0 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.721871E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:40:10.831791 | finish at 2025-09-10 11:50:00 + [2025-09-09 18:09:55] iteration 714/ 11920 | consumed samples: 731136 | elapsed time per iteration (ms): 5687.5 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.709375E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:42:13.756303 | finish at 2025-09-10 11:52:08 + [2025-09-09 18:10:00] iteration 715/ 11920 | consumed samples: 732160 | elapsed time per iteration (ms): 5680.0 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.701940E+00 | loss scale: 1.0 | grad norm: 0.255 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:40:43.930568 | finish at 2025-09-10 11:50:44 + [2025-09-09 18:10:06] iteration 716/ 11920 | consumed samples: 733184 | elapsed time per iteration (ms): 5690.5 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.719758E+00 | loss scale: 1.0 | grad norm: 0.265 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:42:36.648058 | finish at 2025-09-10 11:52:43 + [2025-09-09 18:10:12] iteration 717/ 11920 | consumed samples: 734208 | elapsed time per iteration (ms): 5694.1 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.708154E+00 | loss scale: 1.0 | grad norm: 0.243 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:43:10.640629 | finish at 2025-09-10 11:53:22 + [2025-09-09 18:10:17] iteration 718/ 11920 | consumed samples: 735232 | elapsed time per iteration (ms): 5688.2 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.698535E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:41:58.845129 | finish at 2025-09-10 11:52:16 + [2025-09-09 18:10:23] iteration 719/ 11920 | consumed samples: 736256 | elapsed time per iteration (ms): 5679.0 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.705446E+00 | loss scale: 1.0 | grad norm: 0.230 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:40:09.973160 | finish at 2025-09-10 11:50:33 + [2025-09-09 18:10:29] iteration 720/ 11920 | consumed samples: 737280 | elapsed time per iteration (ms): 5670.9 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.703740E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:38:34.425659 | finish at 2025-09-10 11:49:03 + [2025-09-09 18:10:34] iteration 721/ 11920 | consumed samples: 738304 | elapsed time per iteration (ms): 5679.2 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.702465E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:40:01.221218 | finish at 2025-09-10 11:50:36 + [2025-09-09 18:10:40] iteration 722/ 11920 | consumed samples: 739328 | elapsed time per iteration (ms): 5681.7 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.708111E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:40:23.551021 | finish at 2025-09-10 11:51:04 + [2025-09-09 18:10:46] iteration 723/ 11920 | consumed samples: 740352 | elapsed time per iteration (ms): 5689.6 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.699261E+00 | loss scale: 1.0 | grad norm: 0.238 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:41:46.672672 | finish at 2025-09-10 11:52:32 + [2025-09-09 18:10:51] iteration 724/ 11920 | consumed samples: 741376 | elapsed time per iteration (ms): 5677.7 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.696466E+00 | loss scale: 1.0 | grad norm: 0.258 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:39:27.502985 | finish at 2025-09-10 11:50:19 + [2025-09-09 18:10:57] iteration 725/ 11920 | consumed samples: 742400 | elapsed time per iteration (ms): 5681.6 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.704700E+00 | loss scale: 1.0 | grad norm: 0.251 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:40:05.440985 | finish at 2025-09-10 11:51:02 + [2025-09-09 18:11:03] iteration 726/ 11920 | consumed samples: 743424 | elapsed time per iteration (ms): 5687.1 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.698600E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:41:01.821005 | finish at 2025-09-10 11:52:05 + [2025-09-09 18:11:08] iteration 727/ 11920 | consumed samples: 744448 | elapsed time per iteration (ms): 5677.1 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.679756E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:39:03.296644 | finish at 2025-09-10 11:50:12 + [2025-09-09 18:11:14] iteration 728/ 11920 | consumed samples: 745472 | elapsed time per iteration (ms): 5682.0 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.674818E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:39:52.481495 | finish at 2025-09-10 11:51:07 + [2025-09-09 18:11:20] iteration 729/ 11920 | consumed samples: 746496 | elapsed time per iteration (ms): 5677.6 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.687276E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:38:57.988541 | finish at 2025-09-10 11:50:18 + [2025-09-09 18:11:25] iteration 730/ 11920 | consumed samples: 747520 | elapsed time per iteration (ms): 5674.6 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.673571E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:38:18.492594 | finish at 2025-09-10 11:49:44 + [2025-09-09 18:11:31] iteration 731/ 11920 | consumed samples: 748544 | elapsed time per iteration (ms): 5673.0 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.675753E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:37:55.579565 | finish at 2025-09-10 11:49:27 + [2025-09-09 18:11:37] iteration 732/ 11920 | consumed samples: 749568 | elapsed time per iteration (ms): 5674.3 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.660924E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:38:03.849172 | finish at 2025-09-10 11:49:41 + [2025-09-09 18:11:42] iteration 733/ 11920 | consumed samples: 750592 | elapsed time per iteration (ms): 5679.9 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.659419E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:39:00.912503 | finish at 2025-09-10 11:50:43 + [2025-09-09 18:11:48] iteration 734/ 11920 | consumed samples: 751616 | elapsed time per iteration (ms): 5677.8 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.649098E+00 | loss scale: 1.0 | grad norm: 0.126 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:38:32.035481 | finish at 2025-09-10 11:50:20 + [2025-09-09 18:11:54] iteration 735/ 11920 | consumed samples: 752640 | elapsed time per iteration (ms): 5911.0 | throughput per GPU (TFLOP/s/GPU): 76.4 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.653981E+00 | loss scale: 1.0 | grad norm: 0.110 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:21:54.753820 | finish at 2025-09-10 12:33:49 + [2025-09-09 18:12:00] iteration 736/ 11920 | consumed samples: 753664 | elapsed time per iteration (ms): 5678.9 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.653461E+00 | loss scale: 1.0 | grad norm: 0.121 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:38:32.737644 | finish at 2025-09-10 11:50:32 + [2025-09-09 18:12:05] iteration 737/ 11920 | consumed samples: 754688 | elapsed time per iteration (ms): 5668.1 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.663088E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:36:26.688907 | finish at 2025-09-10 11:48:32 + [2025-09-09 18:12:11] iteration 738/ 11920 | consumed samples: 755712 | elapsed time per iteration (ms): 5676.8 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.663555E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:37:58.404300 | finish at 2025-09-10 11:50:09 + [2025-09-09 18:12:17] iteration 739/ 11920 | consumed samples: 756736 | elapsed time per iteration (ms): 5684.4 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.678153E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:39:17.693171 | finish at 2025-09-10 11:51:34 + [2025-09-09 18:12:22] iteration 740/ 11920 | consumed samples: 757760 | elapsed time per iteration (ms): 5673.4 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.652657E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:37:08.773761 | finish at 2025-09-10 11:49:31 + [2025-09-09 18:12:28] iteration 741/ 11920 | consumed samples: 758784 | elapsed time per iteration (ms): 5672.9 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.662793E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:36:57.178091 | finish at 2025-09-10 11:49:25 + [2025-09-09 18:12:34] iteration 742/ 11920 | consumed samples: 759808 | elapsed time per iteration (ms): 5680.6 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.677506E+00 | loss scale: 1.0 | grad norm: 0.263 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:38:17.418194 | finish at 2025-09-10 11:50:51 + [2025-09-09 18:12:39] iteration 743/ 11920 | consumed samples: 760832 | elapsed time per iteration (ms): 5683.7 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.671082E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:38:46.811779 | finish at 2025-09-10 11:51:26 + [2025-09-09 18:12:45] iteration 744/ 11920 | consumed samples: 761856 | elapsed time per iteration (ms): 5679.3 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.653876E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:37:51.548491 | finish at 2025-09-10 11:50:37 + [2025-09-09 18:12:51] iteration 745/ 11920 | consumed samples: 762880 | elapsed time per iteration (ms): 5674.2 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.670722E+00 | loss scale: 1.0 | grad norm: 0.248 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:36:49.358829 | finish at 2025-09-10 11:49:40 + [2025-09-09 18:12:56] iteration 746/ 11920 | consumed samples: 763904 | elapsed time per iteration (ms): 5669.2 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.678074E+00 | loss scale: 1.0 | grad norm: 0.272 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:35:47.152641 | finish at 2025-09-10 11:48:44 + [2025-09-09 18:13:02] iteration 747/ 11920 | consumed samples: 764928 | elapsed time per iteration (ms): 5668.0 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.676841E+00 | loss scale: 1.0 | grad norm: 0.244 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:35:28.294759 | finish at 2025-09-10 11:48:30 + [2025-09-09 18:13:08] iteration 748/ 11920 | consumed samples: 765952 | elapsed time per iteration (ms): 5667.6 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.671977E+00 | loss scale: 1.0 | grad norm: 0.258 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:35:18.508839 | finish at 2025-09-10 11:48:26 + [2025-09-09 18:13:14] iteration 749/ 11920 | consumed samples: 766976 | elapsed time per iteration (ms): 5952.6 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.683939E+00 | loss scale: 1.0 | grad norm: 0.286 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:28:16.478645 | finish at 2025-09-10 12:41:30 + [2025-09-09 18:13:20] iteration 750/ 11920 | consumed samples: 768000 | elapsed time per iteration (ms): 5930.2 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.670284E+00 | loss scale: 1.0 | grad norm: 0.258 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:24:00.713282 | finish at 2025-09-10 12:37:20 + [2025-09-09 18:13:25] iteration 751/ 11920 | consumed samples: 769024 | elapsed time per iteration (ms): 5676.9 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.678309E+00 | loss scale: 1.0 | grad norm: 0.289 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:36:45.697192 | finish at 2025-09-10 11:50:11 + [2025-09-09 18:13:31] iteration 752/ 11920 | consumed samples: 770048 | elapsed time per iteration (ms): 5672.1 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.670133E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:35:46.287804 | finish at 2025-09-10 11:49:17 + [2025-09-09 18:13:37] iteration 753/ 11920 | consumed samples: 771072 | elapsed time per iteration (ms): 5669.7 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.673066E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:35:13.033005 | finish at 2025-09-10 11:48:50 + [2025-09-09 18:13:42] iteration 754/ 11920 | consumed samples: 772096 | elapsed time per iteration (ms): 5695.3 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.666358E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:39:54.250716 | finish at 2025-09-10 11:53:37 + [2025-09-09 18:13:48] iteration 755/ 11920 | consumed samples: 773120 | elapsed time per iteration (ms): 6007.9 | throughput per GPU (TFLOP/s/GPU): 75.1 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.660825E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:37:58.033131 | finish at 2025-09-10 12:51:46 + [2025-09-09 18:13:55] iteration 756/ 11920 | consumed samples: 774144 | elapsed time per iteration (ms): 6082.4 | throughput per GPU (TFLOP/s/GPU): 74.2 | MFU 7.51% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.654343E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:51:43.624407 | finish at 2025-09-10 13:05:38 + [2025-09-09 18:14:00] iteration 757/ 11920 | consumed samples: 775168 | elapsed time per iteration (ms): 5668.0 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.640653E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:34:31.370145 | finish at 2025-09-10 11:48:32 + [2025-09-09 18:14:06] iteration 758/ 11920 | consumed samples: 776192 | elapsed time per iteration (ms): 5919.3 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.642513E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:21:10.743142 | finish at 2025-09-10 12:35:17 + [2025-09-09 18:14:12] iteration 759/ 11920 | consumed samples: 777216 | elapsed time per iteration (ms): 5671.0 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.641876E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:34:53.897993 | finish at 2025-09-10 11:49:06 + [2025-09-09 18:14:17] iteration 760/ 11920 | consumed samples: 778240 | elapsed time per iteration (ms): 5663.1 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.647535E+00 | loss scale: 1.0 | grad norm: 0.121 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:33:20.076313 | finish at 2025-09-10 11:47:38 + [2025-09-09 18:14:23] iteration 761/ 11920 | consumed samples: 779264 | elapsed time per iteration (ms): 5658.3 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.645166E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:32:20.561782 | finish at 2025-09-10 11:46:44 + [2025-09-09 18:14:29] iteration 762/ 11920 | consumed samples: 780288 | elapsed time per iteration (ms): 6055.6 | throughput per GPU (TFLOP/s/GPU): 74.6 | MFU 7.54% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.626358E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:46:08.841563 | finish at 2025-09-10 13:00:38 + [2025-09-09 18:14:35] iteration 763/ 11920 | consumed samples: 781312 | elapsed time per iteration (ms): 5671.3 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.653283E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:34:35.126954 | finish at 2025-09-10 11:49:10 + [2025-09-09 18:14:41] iteration 764/ 11920 | consumed samples: 782336 | elapsed time per iteration (ms): 5677.9 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.617885E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:35:42.645267 | finish at 2025-09-10 11:50:23 + [2025-09-09 18:14:46] iteration 765/ 11920 | consumed samples: 783360 | elapsed time per iteration (ms): 5681.1 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.598282E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:36:12.195890 | finish at 2025-09-10 11:50:58 + [2025-09-09 18:14:52] iteration 766/ 11920 | consumed samples: 784384 | elapsed time per iteration (ms): 5670.5 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.624884E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:34:08.805315 | finish at 2025-09-10 11:49:01 + [2025-09-09 18:14:58] iteration 767/ 11920 | consumed samples: 785408 | elapsed time per iteration (ms): 5680.8 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.624782E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:35:58.190647 | finish at 2025-09-10 11:50:56 + [2025-09-09 18:15:03] iteration 768/ 11920 | consumed samples: 786432 | elapsed time per iteration (ms): 5672.6 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.613307E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:34:20.460648 | finish at 2025-09-10 11:49:24 + [2025-09-09 18:15:09] iteration 769/ 11920 | consumed samples: 787456 | elapsed time per iteration (ms): 5681.6 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.633541E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:35:55.309958 | finish at 2025-09-10 11:51:04 + [2025-09-09 18:15:15] iteration 770/ 11920 | consumed samples: 788480 | elapsed time per iteration (ms): 5946.4 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.616505E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:25:02.890861 | finish at 2025-09-10 12:40:18 + [2025-09-09 18:15:21] iteration 771/ 11920 | consumed samples: 789504 | elapsed time per iteration (ms): 5956.5 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.611162E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:26:48.532658 | finish at 2025-09-10 12:42:09 + [2025-09-09 18:15:26] iteration 772/ 11920 | consumed samples: 790528 | elapsed time per iteration (ms): 5675.5 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.608906E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:34:30.621906 | finish at 2025-09-10 11:49:57 + [2025-09-09 18:15:32] iteration 773/ 11920 | consumed samples: 791552 | elapsed time per iteration (ms): 5664.8 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.614398E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:32:25.086292 | finish at 2025-09-10 11:47:57 + [2025-09-09 18:15:38] iteration 774/ 11920 | consumed samples: 792576 | elapsed time per iteration (ms): 5883.8 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.615756E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:13:01.167308 | finish at 2025-09-10 12:28:39 + [2025-09-09 18:15:44] iteration 775/ 11920 | consumed samples: 793600 | elapsed time per iteration (ms): 5682.3 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.619964E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:35:29.401914 | finish at 2025-09-10 11:51:13 + [2025-09-09 18:15:49] iteration 776/ 11920 | consumed samples: 794624 | elapsed time per iteration (ms): 5673.2 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.617750E+00 | loss scale: 1.0 | grad norm: 0.282 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:33:41.770283 | finish at 2025-09-10 11:49:31 + [2025-09-09 18:15:55] iteration 777/ 11920 | consumed samples: 795648 | elapsed time per iteration (ms): 5672.7 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.627818E+00 | loss scale: 1.0 | grad norm: 0.361 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:33:31.153001 | finish at 2025-09-10 11:49:26 + [2025-09-09 18:16:01] iteration 778/ 11920 | consumed samples: 796672 | elapsed time per iteration (ms): 5670.1 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.634311E+00 | loss scale: 1.0 | grad norm: 0.390 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:32:56.232655 | finish at 2025-09-10 11:48:57 + [2025-09-09 18:16:07] iteration 779/ 11920 | consumed samples: 797696 | elapsed time per iteration (ms): 5888.3 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.629088E+00 | loss scale: 1.0 | grad norm: 0.245 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:13:21.589497 | finish at 2025-09-10 12:29:28 + [2025-09-09 18:16:12] iteration 780/ 11920 | consumed samples: 798720 | elapsed time per iteration (ms): 5882.9 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.637842E+00 | loss scale: 1.0 | grad norm: 0.319 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:12:15.102286 | finish at 2025-09-10 12:28:28 + [2025-09-09 18:16:18] iteration 781/ 11920 | consumed samples: 799744 | elapsed time per iteration (ms): 5672.1 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.643242E+00 | loss scale: 1.0 | grad norm: 0.284 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:33:02.005993 | finish at 2025-09-10 11:49:20 + [2025-09-09 18:16:24] iteration 782/ 11920 | consumed samples: 800768 | elapsed time per iteration (ms): 5680.1 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.629008E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:34:25.242851 | finish at 2025-09-10 11:50:49 + [2025-09-09 18:16:30] iteration 783/ 11920 | consumed samples: 801792 | elapsed time per iteration (ms): 5676.3 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.625039E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:33:37.171376 | finish at 2025-09-10 11:50:07 + [2025-09-09 18:16:35] iteration 784/ 11920 | consumed samples: 802816 | elapsed time per iteration (ms): 5676.7 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.619496E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:33:36.218353 | finish at 2025-09-10 11:50:11 + [2025-09-09 18:16:41] iteration 785/ 11920 | consumed samples: 803840 | elapsed time per iteration (ms): 5906.0 | throughput per GPU (TFLOP/s/GPU): 76.4 | MFU 7.73% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.618119E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:16:02.862870 | finish at 2025-09-10 12:32:44 + [2025-09-09 18:16:47] iteration 786/ 11920 | consumed samples: 804864 | elapsed time per iteration (ms): 5675.2 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.610328E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:33:08.000494 | finish at 2025-09-10 11:49:55 + [2025-09-09 18:16:52] iteration 787/ 11920 | consumed samples: 805888 | elapsed time per iteration (ms): 5671.9 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.618082E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:32:24.798573 | finish at 2025-09-10 11:49:17 + [2025-09-09 18:16:58] iteration 788/ 11920 | consumed samples: 806912 | elapsed time per iteration (ms): 5663.1 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.608973E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:30:41.642517 | finish at 2025-09-10 11:47:40 + [2025-09-09 18:17:04] iteration 789/ 11920 | consumed samples: 807936 | elapsed time per iteration (ms): 5674.0 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.606978E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:32:37.095238 | finish at 2025-09-10 11:49:41 + [2025-09-09 18:17:09] iteration 790/ 11920 | consumed samples: 808960 | elapsed time per iteration (ms): 5669.2 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.600002E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:31:38.009620 | finish at 2025-09-10 11:48:47 + [2025-09-09 18:17:15] iteration 791/ 11920 | consumed samples: 809984 | elapsed time per iteration (ms): 5669.2 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.611485E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:31:32.372277 | finish at 2025-09-10 11:48:48 + [2025-09-09 18:17:21] iteration 792/ 11920 | consumed samples: 811008 | elapsed time per iteration (ms): 5666.3 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.609845E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:30:54.252756 | finish at 2025-09-10 11:48:15 + [2025-09-09 18:17:27] iteration 793/ 11920 | consumed samples: 812032 | elapsed time per iteration (ms): 6030.4 | throughput per GPU (TFLOP/s/GPU): 74.9 | MFU 7.57% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.587798E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:38:20.359377 | finish at 2025-09-10 12:55:47 + [2025-09-09 18:17:32] iteration 794/ 11920 | consumed samples: 813056 | elapsed time per iteration (ms): 5662.6 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.584466E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:30:01.806870 | finish at 2025-09-10 11:47:34 + [2025-09-09 18:17:38] iteration 795/ 11920 | consumed samples: 814080 | elapsed time per iteration (ms): 5881.9 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.601931E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:10:36.567992 | finish at 2025-09-10 12:28:15 + [2025-09-09 18:17:44] iteration 796/ 11920 | consumed samples: 815104 | elapsed time per iteration (ms): 5682.7 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.587124E+00 | loss scale: 1.0 | grad norm: 0.242 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:33:34.478548 | finish at 2025-09-10 11:51:19 + [2025-09-09 18:17:50] iteration 797/ 11920 | consumed samples: 816128 | elapsed time per iteration (ms): 5680.9 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.613873E+00 | loss scale: 1.0 | grad norm: 0.243 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:33:08.436971 | finish at 2025-09-10 11:50:58 + [2025-09-09 18:17:55] iteration 798/ 11920 | consumed samples: 817152 | elapsed time per iteration (ms): 5667.8 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.605231E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:30:37.263085 | finish at 2025-09-10 11:48:33 + [2025-09-09 18:18:01] iteration 799/ 11920 | consumed samples: 818176 | elapsed time per iteration (ms): 5664.6 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.592816E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:29:56.521865 | finish at 2025-09-10 11:47:58 + [2025-09-09 18:18:07] iteration 800/ 11920 | consumed samples: 819200 | elapsed time per iteration (ms): 5680.5 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.595302E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:32:46.611538 | finish at 2025-09-10 11:50:53 + [2025-09-09 18:18:12] iteration 801/ 11920 | consumed samples: 820224 | elapsed time per iteration (ms): 5666.3 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.592310E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:30:04.131148 | finish at 2025-09-10 11:48:17 + [2025-09-09 18:18:18] iteration 802/ 11920 | consumed samples: 821248 | elapsed time per iteration (ms): 5674.0 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.597794E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:31:24.051820 | finish at 2025-09-10 11:49:42 + [2025-09-09 18:18:24] iteration 803/ 11920 | consumed samples: 822272 | elapsed time per iteration (ms): 5666.3 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.597037E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:29:52.093418 | finish at 2025-09-10 11:48:16 + [2025-09-09 18:18:29] iteration 804/ 11920 | consumed samples: 823296 | elapsed time per iteration (ms): 5676.2 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.577646E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:31:37.059625 | finish at 2025-09-10 11:50:06 + [2025-09-09 18:18:35] iteration 805/ 11920 | consumed samples: 824320 | elapsed time per iteration (ms): 5660.6 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.570831E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:28:37.793663 | finish at 2025-09-10 11:47:13 + [2025-09-09 18:18:41] iteration 806/ 11920 | consumed samples: 825344 | elapsed time per iteration (ms): 5686.4 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.581865E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:33:18.540255 | finish at 2025-09-10 11:51:59 + [2025-09-09 18:18:46] iteration 807/ 11920 | consumed samples: 826368 | elapsed time per iteration (ms): 5677.5 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.590264E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:31:34.092050 | finish at 2025-09-10 11:50:21 + [2025-09-09 18:18:52] iteration 808/ 11920 | consumed samples: 827392 | elapsed time per iteration (ms): 5669.4 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.586824E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:29:58.820274 | finish at 2025-09-10 11:48:51 + [2025-09-09 18:18:58] iteration 809/ 11920 | consumed samples: 828416 | elapsed time per iteration (ms): 5668.2 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.581778E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:29:39.267064 | finish at 2025-09-10 11:48:37 + [2025-09-09 18:19:03] iteration 810/ 11920 | consumed samples: 829440 | elapsed time per iteration (ms): 5673.4 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.580464E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:30:31.626801 | finish at 2025-09-10 11:49:35 + [2025-09-09 18:19:09] iteration 811/ 11920 | consumed samples: 830464 | elapsed time per iteration (ms): 5880.6 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.591709E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:08:47.815556 | finish at 2025-09-10 12:27:57 + [2025-09-09 18:19:15] iteration 812/ 11920 | consumed samples: 831488 | elapsed time per iteration (ms): 6146.5 | throughput per GPU (TFLOP/s/GPU): 73.5 | MFU 7.43% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.573328E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:57:54.910086 | finish at 2025-09-10 13:17:10 + [2025-09-09 18:19:21] iteration 813/ 11920 | consumed samples: 832512 | elapsed time per iteration (ms): 5992.2 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.592587E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:29:15.141823 | finish at 2025-09-10 12:48:37 + [2025-09-09 18:19:28] iteration 814/ 11920 | consumed samples: 833536 | elapsed time per iteration (ms): 6453.5 | throughput per GPU (TFLOP/s/GPU): 70.0 | MFU 7.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.580073E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:54:32.065616 | finish at 2025-09-10 14:14:00 + [2025-09-09 18:19:34] iteration 815/ 11920 | consumed samples: 834560 | elapsed time per iteration (ms): 6118.9 | throughput per GPU (TFLOP/s/GPU): 73.8 | MFU 7.46% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.567568E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:52:30.898815 | finish at 2025-09-10 13:12:05 + [2025-09-09 18:19:40] iteration 816/ 11920 | consumed samples: 835584 | elapsed time per iteration (ms): 5673.0 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.570074E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:29:53.199577 | finish at 2025-09-10 11:49:33 + [2025-09-09 18:19:45] iteration 817/ 11920 | consumed samples: 836608 | elapsed time per iteration (ms): 5680.5 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.573328E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:31:10.988913 | finish at 2025-09-10 11:50:56 + [2025-09-09 18:19:52] iteration 818/ 11920 | consumed samples: 837632 | elapsed time per iteration (ms): 6564.0 | throughput per GPU (TFLOP/s/GPU): 68.8 | MFU 6.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.568702E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 20:14:33.195928 | finish at 2025-09-10 14:34:25 + [2025-09-09 18:19:58] iteration 819/ 11920 | consumed samples: 838656 | elapsed time per iteration (ms): 5670.0 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.558944E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:29:02.845528 | finish at 2025-09-10 11:49:00 + [2025-09-09 18:20:03] iteration 820/ 11920 | consumed samples: 839680 | elapsed time per iteration (ms): 5676.3 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.560404E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:30:06.861734 | finish at 2025-09-10 11:50:10 + [2025-09-09 18:20:09] iteration 821/ 11920 | consumed samples: 840704 | elapsed time per iteration (ms): 6039.0 | throughput per GPU (TFLOP/s/GPU): 74.8 | MFU 7.56% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.572406E+00 | loss scale: 1.0 | grad norm: 0.135 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:37:06.803164 | finish at 2025-09-10 12:57:16 + [2025-09-09 18:20:15] iteration 822/ 11920 | consumed samples: 841728 | elapsed time per iteration (ms): 5672.0 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.561298E+00 | loss scale: 1.0 | grad norm: 0.142 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:29:07.397485 | finish at 2025-09-10 11:49:22 + [2025-09-09 18:20:21] iteration 823/ 11920 | consumed samples: 842752 | elapsed time per iteration (ms): 5668.3 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.559804E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:28:21.674451 | finish at 2025-09-10 11:48:42 + [2025-09-09 18:20:26] iteration 824/ 11920 | consumed samples: 843776 | elapsed time per iteration (ms): 5666.8 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.556532E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:27:59.207224 | finish at 2025-09-10 11:48:26 + [2025-09-09 18:20:32] iteration 825/ 11920 | consumed samples: 844800 | elapsed time per iteration (ms): 5665.8 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.569298E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:27:42.509679 | finish at 2025-09-10 11:48:15 + [2025-09-09 18:20:38] iteration 826/ 11920 | consumed samples: 845824 | elapsed time per iteration (ms): 5666.2 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.560057E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:27:41.078507 | finish at 2025-09-10 11:48:19 + [2025-09-09 18:20:43] iteration 827/ 11920 | consumed samples: 846848 | elapsed time per iteration (ms): 5676.1 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.547065E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:29:25.273689 | finish at 2025-09-10 11:50:09 + [2025-09-09 18:20:49] iteration 828/ 11920 | consumed samples: 847872 | elapsed time per iteration (ms): 5668.0 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.565878E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:27:49.699107 | finish at 2025-09-10 11:48:39 + [2025-09-09 18:20:55] iteration 829/ 11920 | consumed samples: 848896 | elapsed time per iteration (ms): 5670.6 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.569817E+00 | loss scale: 1.0 | grad norm: 0.253 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:28:12.610685 | finish at 2025-09-10 11:49:07 + [2025-09-09 18:21:00] iteration 830/ 11920 | consumed samples: 849920 | elapsed time per iteration (ms): 5671.2 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.572022E+00 | loss scale: 1.0 | grad norm: 0.249 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:28:13.116615 | finish at 2025-09-10 11:49:14 + [2025-09-09 18:21:06] iteration 831/ 11920 | consumed samples: 850944 | elapsed time per iteration (ms): 5666.1 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.554628E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:27:11.306508 | finish at 2025-09-10 11:48:17 + [2025-09-09 18:21:12] iteration 832/ 11920 | consumed samples: 851968 | elapsed time per iteration (ms): 5670.6 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.560831E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:27:55.141548 | finish at 2025-09-10 11:49:07 + [2025-09-09 18:21:17] iteration 833/ 11920 | consumed samples: 852992 | elapsed time per iteration (ms): 5664.6 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.565267E+00 | loss scale: 1.0 | grad norm: 0.256 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:26:43.434901 | finish at 2025-09-10 11:48:01 + [2025-09-09 18:21:23] iteration 834/ 11920 | consumed samples: 854016 | elapsed time per iteration (ms): 5672.9 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.593754E+00 | loss scale: 1.0 | grad norm: 0.360 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:28:09.512591 | finish at 2025-09-10 11:49:33 + [2025-09-09 18:21:29] iteration 835/ 11920 | consumed samples: 855040 | elapsed time per iteration (ms): 5661.2 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.594391E+00 | loss scale: 1.0 | grad norm: 0.421 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:25:53.942657 | finish at 2025-09-10 11:47:23 + [2025-09-09 18:21:34] iteration 836/ 11920 | consumed samples: 856064 | elapsed time per iteration (ms): 5706.7 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 8.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.990715E+00 | loss scale: 1.0 | grad norm: 26.744 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:34:13.465440 | finish at 2025-09-10 11:55:48 + [2025-09-09 18:21:40] iteration 837/ 11920 | consumed samples: 857088 | elapsed time per iteration (ms): 5731.9 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.318710E+00 | loss scale: 1.0 | grad norm: 2.929 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:38:46.515322 | finish at 2025-09-10 12:00:27 + [2025-09-09 18:21:46] iteration 838/ 11920 | consumed samples: 858112 | elapsed time per iteration (ms): 5719.9 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.093310E+00 | loss scale: 1.0 | grad norm: 1.690 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:36:28.044225 | finish at 2025-09-10 11:58:14 + [2025-09-09 18:21:52] iteration 839/ 11920 | consumed samples: 859136 | elapsed time per iteration (ms): 5719.3 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.950365E+00 | loss scale: 1.0 | grad norm: 0.785 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:36:16.099960 | finish at 2025-09-10 11:58:08 + [2025-09-09 18:21:57] iteration 840/ 11920 | consumed samples: 860160 | elapsed time per iteration (ms): 5736.7 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.962802E+00 | loss scale: 1.0 | grad norm: 1.067 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:39:22.462292 | finish at 2025-09-10 12:01:20 + [2025-09-09 18:22:03] iteration 841/ 11920 | consumed samples: 861184 | elapsed time per iteration (ms): 5743.8 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.956920E+00 | loss scale: 1.0 | grad norm: 1.118 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:40:35.218622 | finish at 2025-09-10 12:02:38 + [2025-09-09 18:22:09] iteration 842/ 11920 | consumed samples: 862208 | elapsed time per iteration (ms): 5752.4 | throughput per GPU (TFLOP/s/GPU): 78.5 | MFU 7.94% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.924542E+00 | loss scale: 1.0 | grad norm: 0.507 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:42:04.560730 | finish at 2025-09-10 12:04:13 + [2025-09-09 18:22:15] iteration 843/ 11920 | consumed samples: 863232 | elapsed time per iteration (ms): 5747.5 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.94% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.055473E+00 | loss scale: 1.0 | grad norm: 0.885 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:41:05.247015 | finish at 2025-09-10 12:03:20 + [2025-09-09 18:22:20] iteration 844/ 11920 | consumed samples: 864256 | elapsed time per iteration (ms): 5721.5 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.014394E+00 | loss scale: 1.0 | grad norm: 0.699 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:36:11.618311 | finish at 2025-09-10 11:58:32 + [2025-09-09 18:22:26] iteration 845/ 11920 | consumed samples: 865280 | elapsed time per iteration (ms): 5750.1 | throughput per GPU (TFLOP/s/GPU): 78.5 | MFU 7.94% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.061953E+00 | loss scale: 1.0 | grad norm: 0.850 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:41:21.852031 | finish at 2025-09-10 12:03:48 + [2025-09-09 18:22:32] iteration 846/ 11920 | consumed samples: 866304 | elapsed time per iteration (ms): 5761.5 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.92% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.001711E+00 | loss scale: 1.0 | grad norm: 0.493 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:43:23.322295 | finish at 2025-09-10 12:05:55 + [2025-09-09 18:22:38] iteration 847/ 11920 | consumed samples: 867328 | elapsed time per iteration (ms): 5766.3 | throughput per GPU (TFLOP/s/GPU): 78.3 | MFU 7.92% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.103412E+00 | loss scale: 1.0 | grad norm: 0.842 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:44:10.020370 | finish at 2025-09-10 12:06:48 + [2025-09-09 18:22:43] iteration 848/ 11920 | consumed samples: 868352 | elapsed time per iteration (ms): 5813.6 | throughput per GPU (TFLOP/s/GPU): 77.7 | MFU 7.85% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.062519E+00 | loss scale: 1.0 | grad norm: 0.835 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:52:47.905365 | finish at 2025-09-10 12:15:31 + [2025-09-09 18:22:49] iteration 849/ 11920 | consumed samples: 869376 | elapsed time per iteration (ms): 5768.6 | throughput per GPU (TFLOP/s/GPU): 78.3 | MFU 7.91% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.090564E+00 | loss scale: 1.0 | grad norm: 0.491 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:44:24.210049 | finish at 2025-09-10 12:07:13 + [2025-09-09 18:22:55] iteration 850/ 11920 | consumed samples: 870400 | elapsed time per iteration (ms): 5742.2 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.127887E+00 | loss scale: 1.0 | grad norm: 1.241 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:39:26.635859 | finish at 2025-09-10 12:02:22 + [2025-09-09 18:23:01] iteration 851/ 11920 | consumed samples: 871424 | elapsed time per iteration (ms): 5747.3 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.94% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.128098E+00 | loss scale: 1.0 | grad norm: 0.546 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:40:17.028960 | finish at 2025-09-10 12:03:18 + [2025-09-09 18:23:06] iteration 852/ 11920 | consumed samples: 872448 | elapsed time per iteration (ms): 5718.7 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.111406E+00 | loss scale: 1.0 | grad norm: 0.801 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:34:54.808342 | finish at 2025-09-10 11:58:01 + [2025-09-09 18:23:12] iteration 853/ 11920 | consumed samples: 873472 | elapsed time per iteration (ms): 5730.9 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.165786E+00 | loss scale: 1.0 | grad norm: 0.876 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:37:03.506721 | finish at 2025-09-10 12:00:16 + [2025-09-09 18:23:18] iteration 854/ 11920 | consumed samples: 874496 | elapsed time per iteration (ms): 5714.2 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.066889E+00 | loss scale: 1.0 | grad norm: 0.493 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:33:52.878348 | finish at 2025-09-10 11:57:11 + [2025-09-09 18:23:24] iteration 855/ 11920 | consumed samples: 875520 | elapsed time per iteration (ms): 5742.4 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.099239E+00 | loss scale: 1.0 | grad norm: 0.615 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:38:59.607750 | finish at 2025-09-10 12:02:23 + [2025-09-09 18:23:29] iteration 856/ 11920 | consumed samples: 876544 | elapsed time per iteration (ms): 5720.5 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.060969E+00 | loss scale: 1.0 | grad norm: 0.464 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:34:51.514315 | finish at 2025-09-10 11:58:21 + [2025-09-09 18:23:35] iteration 857/ 11920 | consumed samples: 877568 | elapsed time per iteration (ms): 5737.0 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.034095E+00 | loss scale: 1.0 | grad norm: 0.456 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:37:48.296355 | finish at 2025-09-10 12:01:23 + [2025-09-09 18:23:41] iteration 858/ 11920 | consumed samples: 878592 | elapsed time per iteration (ms): 5735.9 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.030438E+00 | loss scale: 1.0 | grad norm: 0.452 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:37:30.519698 | finish at 2025-09-10 12:01:11 + [2025-09-09 18:23:47] iteration 859/ 11920 | consumed samples: 879616 | elapsed time per iteration (ms): 5748.8 | throughput per GPU (TFLOP/s/GPU): 78.5 | MFU 7.94% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.002319E+00 | loss scale: 1.0 | grad norm: 0.408 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:39:47.994116 | finish at 2025-09-10 12:03:35 + [2025-09-09 18:23:52] iteration 860/ 11920 | consumed samples: 880640 | elapsed time per iteration (ms): 5711.3 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.960186E+00 | loss scale: 1.0 | grad norm: 0.287 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:32:47.063870 | finish at 2025-09-10 11:56:39 + [2025-09-09 18:23:58] iteration 861/ 11920 | consumed samples: 881664 | elapsed time per iteration (ms): 5989.3 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.942618E+00 | loss scale: 1.0 | grad norm: 0.299 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:23:55.568646 | finish at 2025-09-10 12:47:54 + [2025-09-09 18:24:04] iteration 862/ 11920 | consumed samples: 882688 | elapsed time per iteration (ms): 5727.4 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.925403E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:35:33.257722 | finish at 2025-09-10 11:59:37 + [2025-09-09 18:24:10] iteration 863/ 11920 | consumed samples: 883712 | elapsed time per iteration (ms): 5943.8 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.899055E+00 | loss scale: 1.0 | grad norm: 0.267 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:15:21.128808 | finish at 2025-09-10 12:39:31 + [2025-09-09 18:24:16] iteration 864/ 11920 | consumed samples: 884736 | elapsed time per iteration (ms): 5711.7 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.861667E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:32:28.583782 | finish at 2025-09-10 11:56:44 + [2025-09-09 18:24:21] iteration 865/ 11920 | consumed samples: 885760 | elapsed time per iteration (ms): 5722.1 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.839000E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:34:17.480979 | finish at 2025-09-10 11:58:39 + [2025-09-09 18:24:27] iteration 866/ 11920 | consumed samples: 886784 | elapsed time per iteration (ms): 5698.0 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.822677E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:29:45.970855 | finish at 2025-09-10 11:54:13 + [2025-09-09 18:24:33] iteration 867/ 11920 | consumed samples: 887808 | elapsed time per iteration (ms): 5931.8 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.804290E+00 | loss scale: 1.0 | grad norm: 0.136 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:12:43.809965 | finish at 2025-09-10 12:37:17 + [2025-09-09 18:24:39] iteration 868/ 11920 | consumed samples: 888832 | elapsed time per iteration (ms): 5697.5 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.788096E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:29:28.501124 | finish at 2025-09-10 11:54:07 + [2025-09-09 18:24:44] iteration 869/ 11920 | consumed samples: 889856 | elapsed time per iteration (ms): 5705.3 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 8.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.765959E+00 | loss scale: 1.0 | grad norm: 0.118 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:30:49.176473 | finish at 2025-09-10 11:55:34 + [2025-09-09 18:24:50] iteration 870/ 11920 | consumed samples: 890880 | elapsed time per iteration (ms): 5700.6 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.757309E+00 | loss scale: 1.0 | grad norm: 0.111 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:29:51.895080 | finish at 2025-09-10 11:54:42 + [2025-09-09 18:24:56] iteration 871/ 11920 | consumed samples: 891904 | elapsed time per iteration (ms): 5701.7 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.731347E+00 | loss scale: 1.0 | grad norm: 0.092 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:29:57.635164 | finish at 2025-09-10 11:54:53 + [2025-09-09 18:25:01] iteration 872/ 11920 | consumed samples: 892928 | elapsed time per iteration (ms): 5695.1 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.735670E+00 | loss scale: 1.0 | grad norm: 0.093 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:28:39.465563 | finish at 2025-09-10 11:53:41 + [2025-09-09 18:25:07] iteration 873/ 11920 | consumed samples: 893952 | elapsed time per iteration (ms): 5692.4 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.719192E+00 | loss scale: 1.0 | grad norm: 0.091 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:28:03.676549 | finish at 2025-09-10 11:53:11 + [2025-09-09 18:25:13] iteration 874/ 11920 | consumed samples: 894976 | elapsed time per iteration (ms): 5693.4 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.699650E+00 | loss scale: 1.0 | grad norm: 0.104 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:28:09.024106 | finish at 2025-09-10 11:53:22 + [2025-09-09 18:25:19] iteration 875/ 11920 | consumed samples: 896000 | elapsed time per iteration (ms): 5690.8 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.707590E+00 | loss scale: 1.0 | grad norm: 0.081 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:27:35.269932 | finish at 2025-09-10 11:52:54 + [2025-09-09 18:25:24] iteration 876/ 11920 | consumed samples: 897024 | elapsed time per iteration (ms): 5691.6 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.699879E+00 | loss scale: 1.0 | grad norm: 0.088 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:27:38.055029 | finish at 2025-09-10 11:53:02 + [2025-09-09 18:25:30] iteration 877/ 11920 | consumed samples: 898048 | elapsed time per iteration (ms): 6009.0 | throughput per GPU (TFLOP/s/GPU): 75.1 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.671568E+00 | loss scale: 1.0 | grad norm: 0.095 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:25:57.622019 | finish at 2025-09-10 12:51:28 + [2025-09-09 18:25:36] iteration 878/ 11920 | consumed samples: 899072 | elapsed time per iteration (ms): 5686.8 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.672442E+00 | loss scale: 1.0 | grad norm: 0.097 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:26:34.116872 | finish at 2025-09-10 11:52:10 + [2025-09-09 18:25:42] iteration 879/ 11920 | consumed samples: 900096 | elapsed time per iteration (ms): 5702.1 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.671363E+00 | loss scale: 1.0 | grad norm: 0.092 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:29:17.294544 | finish at 2025-09-10 11:54:59 + [2025-09-09 18:25:47] iteration 880/ 11920 | consumed samples: 901120 | elapsed time per iteration (ms): 5700.9 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.662816E+00 | loss scale: 1.0 | grad norm: 0.105 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:28:57.857895 | finish at 2025-09-10 11:54:45 + [2025-09-09 18:25:53] iteration 881/ 11920 | consumed samples: 902144 | elapsed time per iteration (ms): 5692.2 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.668538E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:27:16.095186 | finish at 2025-09-10 11:53:09 + [2025-09-09 18:25:59] iteration 882/ 11920 | consumed samples: 903168 | elapsed time per iteration (ms): 5695.3 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.662449E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:27:44.906745 | finish at 2025-09-10 11:53:44 + [2025-09-09 18:26:05] iteration 883/ 11920 | consumed samples: 904192 | elapsed time per iteration (ms): 5964.2 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.649907E+00 | loss scale: 1.0 | grad norm: 0.244 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:17:06.896670 | finish at 2025-09-10 12:43:12 + [2025-09-09 18:26:11] iteration 884/ 11920 | consumed samples: 905216 | elapsed time per iteration (ms): 5997.3 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.660283E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:23:05.717664 | finish at 2025-09-10 12:49:16 + [2025-09-09 18:26:16] iteration 885/ 11920 | consumed samples: 906240 | elapsed time per iteration (ms): 5693.3 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.641571E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:27:05.584013 | finish at 2025-09-10 11:53:22 + [2025-09-09 18:26:22] iteration 886/ 11920 | consumed samples: 907264 | elapsed time per iteration (ms): 5694.1 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.645419E+00 | loss scale: 1.0 | grad norm: 0.315 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:27:08.251110 | finish at 2025-09-10 11:53:30 + [2025-09-09 18:26:28] iteration 887/ 11920 | consumed samples: 908288 | elapsed time per iteration (ms): 6136.4 | throughput per GPU (TFLOP/s/GPU): 73.6 | MFU 7.44% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.632617E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:48:22.622197 | finish at 2025-09-10 13:14:51 + [2025-09-09 18:26:34] iteration 888/ 11920 | consumed samples: 909312 | elapsed time per iteration (ms): 5682.0 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.644224E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:24:43.639021 | finish at 2025-09-10 11:51:18 + [2025-09-09 18:26:40] iteration 889/ 11920 | consumed samples: 910336 | elapsed time per iteration (ms): 5678.1 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.611842E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:23:54.925054 | finish at 2025-09-10 11:50:34 + [2025-09-09 18:26:45] iteration 890/ 11920 | consumed samples: 911360 | elapsed time per iteration (ms): 5688.9 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.625303E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:25:48.706310 | finish at 2025-09-10 11:52:34 + [2025-09-09 18:26:51] iteration 891/ 11920 | consumed samples: 912384 | elapsed time per iteration (ms): 5681.0 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.623786E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:24:15.677940 | finish at 2025-09-10 11:51:07 + [2025-09-09 18:26:57] iteration 892/ 11920 | consumed samples: 913408 | elapsed time per iteration (ms): 5880.3 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.604330E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:00:48.217083 | finish at 2025-09-10 12:27:45 + [2025-09-09 18:27:02] iteration 893/ 11920 | consumed samples: 914432 | elapsed time per iteration (ms): 5689.2 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.611845E+00 | loss scale: 1.0 | grad norm: 0.120 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:25:34.794422 | finish at 2025-09-10 11:52:37 + [2025-09-09 18:27:08] iteration 894/ 11920 | consumed samples: 915456 | elapsed time per iteration (ms): 5682.6 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.608448E+00 | loss scale: 1.0 | grad norm: 0.110 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:24:16.119130 | finish at 2025-09-10 11:51:24 + [2025-09-09 18:27:14] iteration 895/ 11920 | consumed samples: 916480 | elapsed time per iteration (ms): 5677.1 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.588866E+00 | loss scale: 1.0 | grad norm: 0.102 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:23:09.661503 | finish at 2025-09-10 11:50:24 + [2025-09-09 18:27:20] iteration 896/ 11920 | consumed samples: 917504 | elapsed time per iteration (ms): 5676.0 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.590873E+00 | loss scale: 1.0 | grad norm: 0.106 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:22:51.678612 | finish at 2025-09-10 11:50:11 + [2025-09-09 18:27:25] iteration 897/ 11920 | consumed samples: 918528 | elapsed time per iteration (ms): 5669.3 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.579874E+00 | loss scale: 1.0 | grad norm: 0.099 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:21:32.902394 | finish at 2025-09-10 11:48:58 + [2025-09-09 18:27:31] iteration 898/ 11920 | consumed samples: 919552 | elapsed time per iteration (ms): 5901.6 | throughput per GPU (TFLOP/s/GPU): 76.5 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.594748E+00 | loss scale: 1.0 | grad norm: 0.104 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:04:07.052884 | finish at 2025-09-10 12:31:38 + [2025-09-09 18:27:37] iteration 899/ 11920 | consumed samples: 920576 | elapsed time per iteration (ms): 5675.9 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.581721E+00 | loss scale: 1.0 | grad norm: 0.104 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:22:34.293405 | finish at 2025-09-10 11:50:11 + [2025-09-09 18:27:42] iteration 900/ 11920 | consumed samples: 921600 | elapsed time per iteration (ms): 5681.1 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.567956E+00 | loss scale: 1.0 | grad norm: 0.096 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:23:25.250506 | finish at 2025-09-10 11:51:08 + [2025-09-09 18:27:48] iteration 901/ 11920 | consumed samples: 922624 | elapsed time per iteration (ms): 5673.9 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.578383E+00 | loss scale: 1.0 | grad norm: 0.093 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:22:00.298298 | finish at 2025-09-10 11:49:48 + [2025-09-09 18:27:54] iteration 902/ 11920 | consumed samples: 923648 | elapsed time per iteration (ms): 5683.5 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.562394E+00 | loss scale: 1.0 | grad norm: 0.109 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:23:40.490967 | finish at 2025-09-10 11:51:34 + [2025-09-09 18:28:00] iteration 903/ 11920 | consumed samples: 924672 | elapsed time per iteration (ms): 5695.9 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.563369E+00 | loss scale: 1.0 | grad norm: 0.123 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:25:51.398938 | finish at 2025-09-10 11:53:51 + [2025-09-09 18:28:05] iteration 904/ 11920 | consumed samples: 925696 | elapsed time per iteration (ms): 5685.2 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.560897E+00 | loss scale: 1.0 | grad norm: 0.131 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:23:47.687553 | finish at 2025-09-10 11:51:53 + [2025-09-09 18:28:11] iteration 905/ 11920 | consumed samples: 926720 | elapsed time per iteration (ms): 5666.7 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.553888E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:20:18.368349 | finish at 2025-09-10 11:48:29 + [2025-09-09 18:28:17] iteration 906/ 11920 | consumed samples: 927744 | elapsed time per iteration (ms): 6394.6 | throughput per GPU (TFLOP/s/GPU): 70.6 | MFU 7.14% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.543564E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:33:50.036803 | finish at 2025-09-10 14:02:07 + [2025-09-09 18:28:23] iteration 907/ 11920 | consumed samples: 928768 | elapsed time per iteration (ms): 5674.2 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.563470E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:21:29.497863 | finish at 2025-09-10 11:49:52 + [2025-09-09 18:28:29] iteration 908/ 11920 | consumed samples: 929792 | elapsed time per iteration (ms): 5670.0 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.554556E+00 | loss scale: 1.0 | grad norm: 0.246 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:20:38.169488 | finish at 2025-09-10 11:49:07 + [2025-09-09 18:28:34] iteration 909/ 11920 | consumed samples: 930816 | elapsed time per iteration (ms): 5668.5 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.554465E+00 | loss scale: 1.0 | grad norm: 0.255 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:20:15.593014 | finish at 2025-09-10 11:48:50 + [2025-09-09 18:28:40] iteration 910/ 11920 | consumed samples: 931840 | elapsed time per iteration (ms): 5671.9 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.561195E+00 | loss scale: 1.0 | grad norm: 0.241 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:20:47.963247 | finish at 2025-09-10 11:49:28 + [2025-09-09 18:28:46] iteration 911/ 11920 | consumed samples: 932864 | elapsed time per iteration (ms): 5675.9 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.559668E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:21:25.597069 | finish at 2025-09-10 11:50:11 + [2025-09-09 18:28:52] iteration 912/ 11920 | consumed samples: 933888 | elapsed time per iteration (ms): 5899.2 | throughput per GPU (TFLOP/s/GPU): 76.5 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.549751E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:02:18.603149 | finish at 2025-09-10 12:31:10 + [2025-09-09 18:28:57] iteration 913/ 11920 | consumed samples: 934912 | elapsed time per iteration (ms): 5672.0 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.544820E+00 | loss scale: 1.0 | grad norm: 0.238 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:20:31.944677 | finish at 2025-09-10 11:49:29 + [2025-09-09 18:29:03] iteration 914/ 11920 | consumed samples: 935936 | elapsed time per iteration (ms): 5668.2 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.556789E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:19:44.408803 | finish at 2025-09-10 11:48:47 + [2025-09-09 18:29:09] iteration 915/ 11920 | consumed samples: 936960 | elapsed time per iteration (ms): 5670.3 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.546850E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:20:01.139935 | finish at 2025-09-10 11:49:10 + [2025-09-09 18:29:14] iteration 916/ 11920 | consumed samples: 937984 | elapsed time per iteration (ms): 5669.0 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.539303E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:19:41.890145 | finish at 2025-09-10 11:48:56 + [2025-09-09 18:29:20] iteration 917/ 11920 | consumed samples: 939008 | elapsed time per iteration (ms): 5668.4 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.535441E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:19:29.342782 | finish at 2025-09-10 11:48:49 + [2025-09-09 18:29:26] iteration 918/ 11920 | consumed samples: 940032 | elapsed time per iteration (ms): 5665.0 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.534865E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:18:46.594512 | finish at 2025-09-10 11:48:12 + [2025-09-09 18:29:31] iteration 919/ 11920 | consumed samples: 941056 | elapsed time per iteration (ms): 5669.6 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.529329E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:19:31.044145 | finish at 2025-09-10 11:49:02 + [2025-09-09 18:29:37] iteration 920/ 11920 | consumed samples: 942080 | elapsed time per iteration (ms): 5663.7 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.541017E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:18:20.803423 | finish at 2025-09-10 11:47:58 + [2025-09-09 18:29:43] iteration 921/ 11920 | consumed samples: 943104 | elapsed time per iteration (ms): 5671.4 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.539186E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:19:39.579897 | finish at 2025-09-10 11:49:22 + [2025-09-09 18:29:48] iteration 922/ 11920 | consumed samples: 944128 | elapsed time per iteration (ms): 5675.2 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.530271E+00 | loss scale: 1.0 | grad norm: 0.137 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:20:15.555762 | finish at 2025-09-10 11:50:04 + [2025-09-09 18:29:54] iteration 923/ 11920 | consumed samples: 945152 | elapsed time per iteration (ms): 5673.1 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.538357E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:19:47.093751 | finish at 2025-09-10 11:49:41 + [2025-09-09 18:30:00] iteration 924/ 11920 | consumed samples: 946176 | elapsed time per iteration (ms): 5894.6 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.525084E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:00:17.188446 | finish at 2025-09-10 12:30:17 + [2025-09-09 18:30:05] iteration 925/ 11920 | consumed samples: 947200 | elapsed time per iteration (ms): 5673.8 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.524233E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:19:43.048182 | finish at 2025-09-10 11:49:48 + [2025-09-09 18:30:11] iteration 926/ 11920 | consumed samples: 948224 | elapsed time per iteration (ms): 5677.9 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.522747E+00 | loss scale: 1.0 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:20:22.980220 | finish at 2025-09-10 11:50:34 + [2025-09-09 18:30:17] iteration 927/ 11920 | consumed samples: 949248 | elapsed time per iteration (ms): 5875.6 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.522405E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:56:30.825721 | finish at 2025-09-10 12:26:48 + [2025-09-09 18:30:23] iteration 928/ 11920 | consumed samples: 950272 | elapsed time per iteration (ms): 5926.3 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.516819E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:05:42.021172 | finish at 2025-09-10 12:36:05 + [2025-09-09 18:30:29] iteration 929/ 11920 | consumed samples: 951296 | elapsed time per iteration (ms): 5668.5 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.514509E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:18:22.220867 | finish at 2025-09-10 11:48:51 + [2025-09-09 18:30:35] iteration 930/ 11920 | consumed samples: 952320 | elapsed time per iteration (ms): 6009.8 | throughput per GPU (TFLOP/s/GPU): 75.1 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.506910E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:20:47.761796 | finish at 2025-09-10 12:51:22 + [2025-09-09 18:30:41] iteration 931/ 11920 | consumed samples: 953344 | elapsed time per iteration (ms): 6372.1 | throughput per GPU (TFLOP/s/GPU): 70.9 | MFU 7.16% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.521545E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:27:02.662324 | finish at 2025-09-10 13:57:44 + [2025-09-09 18:30:47] iteration 932/ 11920 | consumed samples: 954368 | elapsed time per iteration (ms): 6017.3 | throughput per GPU (TFLOP/s/GPU): 75.0 | MFU 7.59% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.515259E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:21:57.727053 | finish at 2025-09-10 12:52:45 + [2025-09-09 18:30:53] iteration 933/ 11920 | consumed samples: 955392 | elapsed time per iteration (ms): 5662.6 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.512338E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:16:54.976166 | finish at 2025-09-10 11:47:48 + [2025-09-09 18:30:59] iteration 934/ 11920 | consumed samples: 956416 | elapsed time per iteration (ms): 5942.4 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.519813E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:08:03.410468 | finish at 2025-09-10 12:39:02 + [2025-09-09 18:31:05] iteration 935/ 11920 | consumed samples: 957440 | elapsed time per iteration (ms): 6244.7 | throughput per GPU (TFLOP/s/GPU): 72.3 | MFU 7.31% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.521696E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 19:03:17.520914 | finish at 2025-09-10 13:34:22 + [2025-09-09 18:31:11] iteration 936/ 11920 | consumed samples: 958464 | elapsed time per iteration (ms): 5666.8 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.530729E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:17:24.157631 | finish at 2025-09-10 11:48:35 + [2025-09-09 18:31:16] iteration 937/ 11920 | consumed samples: 959488 | elapsed time per iteration (ms): 5662.8 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.513575E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:16:34.444178 | finish at 2025-09-10 11:47:51 + [2025-09-09 18:31:22] iteration 938/ 11920 | consumed samples: 960512 | elapsed time per iteration (ms): 6022.5 | throughput per GPU (TFLOP/s/GPU): 75.0 | MFU 7.58% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.515391E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:22:18.833588 | finish at 2025-09-10 12:53:41 + [2025-09-09 18:31:28] iteration 939/ 11920 | consumed samples: 961536 | elapsed time per iteration (ms): 6036.6 | throughput per GPU (TFLOP/s/GPU): 74.8 | MFU 7.56% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.522365E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:24:47.693776 | finish at 2025-09-10 12:56:16 + [2025-09-09 18:31:34] iteration 940/ 11920 | consumed samples: 962560 | elapsed time per iteration (ms): 5670.9 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.505221E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:17:46.252799 | finish at 2025-09-10 11:49:20 + [2025-09-09 18:31:40] iteration 941/ 11920 | consumed samples: 963584 | elapsed time per iteration (ms): 5661.6 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.514550E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:15:58.804491 | finish at 2025-09-10 11:47:38 + [2025-09-09 18:31:45] iteration 942/ 11920 | consumed samples: 964608 | elapsed time per iteration (ms): 5668.9 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.517357E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:17:12.974954 | finish at 2025-09-10 11:48:58 + [2025-09-09 18:31:51] iteration 943/ 11920 | consumed samples: 965632 | elapsed time per iteration (ms): 5671.9 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.496978E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:17:40.357691 | finish at 2025-09-10 11:49:31 + [2025-09-09 18:31:57] iteration 944/ 11920 | consumed samples: 966656 | elapsed time per iteration (ms): 5664.9 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.501130E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:16:17.995445 | finish at 2025-09-10 11:48:15 + [2025-09-09 18:32:02] iteration 945/ 11920 | consumed samples: 967680 | elapsed time per iteration (ms): 5658.7 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.497929E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:15:04.190516 | finish at 2025-09-10 11:47:06 + [2025-09-09 18:32:08] iteration 946/ 11920 | consumed samples: 968704 | elapsed time per iteration (ms): 5660.3 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.514772E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:15:15.755618 | finish at 2025-09-10 11:47:24 + [2025-09-09 18:32:14] iteration 947/ 11920 | consumed samples: 969728 | elapsed time per iteration (ms): 5667.4 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.514241E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:16:28.773960 | finish at 2025-09-10 11:48:42 + [2025-09-09 18:32:19] iteration 948/ 11920 | consumed samples: 970752 | elapsed time per iteration (ms): 5665.0 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.519151E+00 | loss scale: 1.0 | grad norm: 0.229 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:15:56.201698 | finish at 2025-09-10 11:48:15 + [2025-09-09 18:32:25] iteration 949/ 11920 | consumed samples: 971776 | elapsed time per iteration (ms): 5660.2 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.513297E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:14:57.715466 | finish at 2025-09-10 11:47:23 + [2025-09-09 18:32:31] iteration 950/ 11920 | consumed samples: 972800 | elapsed time per iteration (ms): 5668.5 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.501807E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:16:23.535955 | finish at 2025-09-10 11:48:54 + [2025-09-09 18:32:36] iteration 951/ 11920 | consumed samples: 973824 | elapsed time per iteration (ms): 5661.5 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.511921E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:15:00.566969 | finish at 2025-09-10 11:47:37 + [2025-09-09 18:32:42] iteration 952/ 11920 | consumed samples: 974848 | elapsed time per iteration (ms): 5664.1 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.497649E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:15:24.075554 | finish at 2025-09-10 11:48:06 + [2025-09-09 18:32:48] iteration 953/ 11920 | consumed samples: 975872 | elapsed time per iteration (ms): 5670.4 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.492660E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:16:26.836474 | finish at 2025-09-10 11:49:14 + [2025-09-09 18:32:53] iteration 954/ 11920 | consumed samples: 976896 | elapsed time per iteration (ms): 5659.8 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.499878E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:14:25.194820 | finish at 2025-09-10 11:47:18 + [2025-09-09 18:32:59] iteration 955/ 11920 | consumed samples: 977920 | elapsed time per iteration (ms): 5663.9 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.489438E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:15:04.887214 | finish at 2025-09-10 11:48:04 + [2025-09-09 18:33:05] iteration 956/ 11920 | consumed samples: 978944 | elapsed time per iteration (ms): 5661.3 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.493452E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:14:30.301762 | finish at 2025-09-10 11:47:35 + [2025-09-09 18:33:10] iteration 957/ 11920 | consumed samples: 979968 | elapsed time per iteration (ms): 5663.1 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.487529E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:14:44.609780 | finish at 2025-09-10 11:47:55 + [2025-09-09 18:33:16] iteration 958/ 11920 | consumed samples: 980992 | elapsed time per iteration (ms): 5661.5 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.484801E+00 | loss scale: 1.0 | grad norm: 0.142 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:14:21.362749 | finish at 2025-09-10 11:47:37 + [2025-09-09 18:33:22] iteration 959/ 11920 | consumed samples: 982016 | elapsed time per iteration (ms): 5657.5 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.497515E+00 | loss scale: 1.0 | grad norm: 0.134 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:13:31.844747 | finish at 2025-09-10 11:46:53 + [2025-09-09 18:33:27] iteration 960/ 11920 | consumed samples: 983040 | elapsed time per iteration (ms): 5655.5 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.475608E+00 | loss scale: 1.0 | grad norm: 0.136 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:13:04.229641 | finish at 2025-09-10 11:46:31 + [2025-09-09 18:33:33] iteration 961/ 11920 | consumed samples: 984064 | elapsed time per iteration (ms): 5662.6 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.489323E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:14:16.376361 | finish at 2025-09-10 11:47:49 + [2025-09-09 18:33:39] iteration 962/ 11920 | consumed samples: 985088 | elapsed time per iteration (ms): 5664.7 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.484225E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:14:34.049427 | finish at 2025-09-10 11:48:13 + [2025-09-09 18:33:44] iteration 963/ 11920 | consumed samples: 986112 | elapsed time per iteration (ms): 5684.8 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.486855E+00 | loss scale: 1.0 | grad norm: 0.238 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:18:08.880305 | finish at 2025-09-10 11:51:53 + [2025-09-09 18:33:50] iteration 964/ 11920 | consumed samples: 987136 | elapsed time per iteration (ms): 5667.3 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.515860E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:14:50.622580 | finish at 2025-09-10 11:48:40 + [2025-09-09 18:33:56] iteration 965/ 11920 | consumed samples: 988160 | elapsed time per iteration (ms): 5877.2 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.476319E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:53:04.792684 | finish at 2025-09-10 12:27:01 + [2025-09-09 18:34:02] iteration 966/ 11920 | consumed samples: 989184 | elapsed time per iteration (ms): 6196.4 | throughput per GPU (TFLOP/s/GPU): 72.9 | MFU 7.37% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.497197E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:51:15.205492 | finish at 2025-09-10 13:25:17 + [2025-09-09 18:34:08] iteration 967/ 11920 | consumed samples: 990208 | elapsed time per iteration (ms): 5862.3 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.479923E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:50:10.314185 | finish at 2025-09-10 12:24:18 + [2025-09-09 18:34:13] iteration 968/ 11920 | consumed samples: 991232 | elapsed time per iteration (ms): 5656.6 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.493342E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:12:30.772455 | finish at 2025-09-10 11:46:44 + [2025-09-09 18:34:19] iteration 969/ 11920 | consumed samples: 992256 | elapsed time per iteration (ms): 5661.9 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.479854E+00 | loss scale: 1.0 | grad norm: 0.137 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:13:23.480431 | finish at 2025-09-10 11:47:43 + [2025-09-09 18:34:25] iteration 970/ 11920 | consumed samples: 993280 | elapsed time per iteration (ms): 5660.8 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.471801E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:13:05.798943 | finish at 2025-09-10 11:47:31 + [2025-09-09 18:34:30] iteration 971/ 11920 | consumed samples: 994304 | elapsed time per iteration (ms): 5659.8 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.461316E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:12:49.119451 | finish at 2025-09-10 11:47:20 + [2025-09-09 18:34:36] iteration 972/ 11920 | consumed samples: 995328 | elapsed time per iteration (ms): 5655.8 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.475113E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:11:59.347162 | finish at 2025-09-10 11:46:35 + [2025-09-09 18:34:42] iteration 973/ 11920 | consumed samples: 996352 | elapsed time per iteration (ms): 5664.5 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.473492E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:13:29.190130 | finish at 2025-09-10 11:48:11 + [2025-09-09 18:34:47] iteration 974/ 11920 | consumed samples: 997376 | elapsed time per iteration (ms): 5674.8 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.474856E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:15:16.641766 | finish at 2025-09-10 11:50:04 + [2025-09-09 18:34:53] iteration 975/ 11920 | consumed samples: 998400 | elapsed time per iteration (ms): 5663.7 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.472636E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:13:08.785336 | finish at 2025-09-10 11:48:02 + [2025-09-09 18:34:59] iteration 976/ 11920 | consumed samples: 999424 | elapsed time per iteration (ms): 5671.4 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.483339E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:14:27.914566 | finish at 2025-09-10 11:49:27 + [2025-09-09 18:35:04] iteration 977/ 11920 | consumed samples: 1000448 | elapsed time per iteration (ms): 5664.7 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.470880E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:13:08.681992 | finish at 2025-09-10 11:48:13 + [2025-09-09 18:35:10] iteration 978/ 11920 | consumed samples: 1001472 | elapsed time per iteration (ms): 5664.3 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.462506E+00 | loss scale: 1.0 | grad norm: 0.229 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:12:59.012832 | finish at 2025-09-10 11:48:09 + [2025-09-09 18:35:16] iteration 979/ 11920 | consumed samples: 1002496 | elapsed time per iteration (ms): 5672.3 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.484738E+00 | loss scale: 1.0 | grad norm: 0.261 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:14:20.424106 | finish at 2025-09-10 11:49:36 + [2025-09-09 18:35:21] iteration 980/ 11920 | consumed samples: 1003520 | elapsed time per iteration (ms): 5670.2 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.507717E+00 | loss scale: 1.0 | grad norm: 0.241 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:13:51.830091 | finish at 2025-09-10 11:49:13 + [2025-09-09 18:35:27] iteration 981/ 11920 | consumed samples: 1004544 | elapsed time per iteration (ms): 5666.5 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.491039E+00 | loss scale: 1.0 | grad norm: 0.238 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:13:05.990553 | finish at 2025-09-10 11:48:33 + [2025-09-09 18:35:33] iteration 982/ 11920 | consumed samples: 1005568 | elapsed time per iteration (ms): 5660.3 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.497978E+00 | loss scale: 1.0 | grad norm: 0.260 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:11:52.857066 | finish at 2025-09-10 11:47:26 + [2025-09-09 18:35:38] iteration 983/ 11920 | consumed samples: 1006592 | elapsed time per iteration (ms): 5661.9 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.463660E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:12:04.211206 | finish at 2025-09-10 11:47:43 + [2025-09-09 18:35:44] iteration 984/ 11920 | consumed samples: 1007616 | elapsed time per iteration (ms): 5984.9 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.471287E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:10:50.725578 | finish at 2025-09-10 12:46:35 + [2025-09-09 18:35:50] iteration 985/ 11920 | consumed samples: 1008640 | elapsed time per iteration (ms): 5679.0 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.466557E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:14:59.407672 | finish at 2025-09-10 11:50:50 + [2025-09-09 18:35:56] iteration 986/ 11920 | consumed samples: 1009664 | elapsed time per iteration (ms): 5671.5 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.472449E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:13:32.248424 | finish at 2025-09-10 11:49:28 + [2025-09-09 18:36:01] iteration 987/ 11920 | consumed samples: 1010688 | elapsed time per iteration (ms): 5659.4 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.454374E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:11:14.412941 | finish at 2025-09-10 11:47:16 + [2025-09-09 18:36:07] iteration 988/ 11920 | consumed samples: 1011712 | elapsed time per iteration (ms): 5655.0 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.469458E+00 | loss scale: 1.0 | grad norm: 0.137 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:10:20.298091 | finish at 2025-09-10 11:46:27 + [2025-09-09 18:36:13] iteration 989/ 11920 | consumed samples: 1012736 | elapsed time per iteration (ms): 5655.2 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.471783E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:10:17.158044 | finish at 2025-09-10 11:46:30 + [2025-09-09 18:36:18] iteration 990/ 11920 | consumed samples: 1013760 | elapsed time per iteration (ms): 5662.3 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.449558E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:11:28.708274 | finish at 2025-09-10 11:47:47 + [2025-09-09 18:36:24] iteration 991/ 11920 | consumed samples: 1014784 | elapsed time per iteration (ms): 5657.4 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.461506E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:10:30.210690 | finish at 2025-09-10 11:46:54 + [2025-09-09 18:36:30] iteration 992/ 11920 | consumed samples: 1015808 | elapsed time per iteration (ms): 5654.8 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.448185E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:09:56.161785 | finish at 2025-09-10 11:46:26 + [2025-09-09 18:36:35] iteration 993/ 11920 | consumed samples: 1016832 | elapsed time per iteration (ms): 5659.7 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.447203E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:10:43.395100 | finish at 2025-09-10 11:47:19 + [2025-09-09 18:36:41] iteration 994/ 11920 | consumed samples: 1017856 | elapsed time per iteration (ms): 5885.3 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.455232E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:51:42.867691 | finish at 2025-09-10 12:28:24 + [2025-09-09 18:36:47] iteration 995/ 11920 | consumed samples: 1018880 | elapsed time per iteration (ms): 5670.0 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.448826E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:12:24.383568 | finish at 2025-09-10 11:49:11 + [2025-09-09 18:36:53] iteration 996/ 11920 | consumed samples: 1019904 | elapsed time per iteration (ms): 5660.2 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.446385E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:10:31.601569 | finish at 2025-09-10 11:47:24 + [2025-09-09 18:36:58] iteration 997/ 11920 | consumed samples: 1020928 | elapsed time per iteration (ms): 5655.4 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.457445E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:09:33.575227 | finish at 2025-09-10 11:46:32 + [2025-09-09 18:37:04] iteration 998/ 11920 | consumed samples: 1021952 | elapsed time per iteration (ms): 5666.3 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.465917E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:11:27.670362 | finish at 2025-09-10 11:48:32 + [2025-09-09 18:37:10] iteration 999/ 11920 | consumed samples: 1022976 | elapsed time per iteration (ms): 5661.6 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.466580E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 11.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:10:30.110909 | finish at 2025-09-10 11:47:40 + [2025-09-09 18:37:15] iteration 1000/ 11920 | consumed samples: 1024000 | elapsed time per iteration (ms): 5658.2 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.466210E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:09:48.075399 | finish at 2025-09-10 11:47:03 + [2025-09-09 18:37:21] iteration 1001/ 11920 | consumed samples: 1025024 | elapsed time per iteration (ms): 5667.4 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.461389E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:11:22.287260 | finish at 2025-09-10 11:48:43 + [2025-09-09 18:37:27] iteration 1002/ 11920 | consumed samples: 1026048 | elapsed time per iteration (ms): 5670.3 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.441373E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:11:48.876910 | finish at 2025-09-10 11:49:15 + [2025-09-09 18:37:32] iteration 1003/ 11920 | consumed samples: 1027072 | elapsed time per iteration (ms): 5667.9 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.450948E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:11:16.277830 | finish at 2025-09-10 11:48:49 + [2025-09-09 18:37:38] iteration 1004/ 11920 | consumed samples: 1028096 | elapsed time per iteration (ms): 5675.2 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.444153E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:12:30.285246 | finish at 2025-09-10 11:50:08 + [2025-09-09 18:37:44] iteration 1005/ 11920 | consumed samples: 1029120 | elapsed time per iteration (ms): 5673.0 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.449611E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:12:00.296413 | finish at 2025-09-10 11:49:44 + [2025-09-09 18:37:49] iteration 1006/ 11920 | consumed samples: 1030144 | elapsed time per iteration (ms): 5670.1 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.427066E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:11:23.504940 | finish at 2025-09-10 11:49:13 + [2025-09-09 18:37:55] iteration 1007/ 11920 | consumed samples: 1031168 | elapsed time per iteration (ms): 5652.0 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.440042E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:08:00.600692 | finish at 2025-09-10 11:45:56 + [2025-09-09 18:38:01] iteration 1008/ 11920 | consumed samples: 1032192 | elapsed time per iteration (ms): 5654.0 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.432381E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:08:15.985390 | finish at 2025-09-10 11:46:17 + [2025-09-09 18:38:06] iteration 1009/ 11920 | consumed samples: 1033216 | elapsed time per iteration (ms): 5645.9 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.449817E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:06:42.147078 | finish at 2025-09-10 11:44:48 + [2025-09-09 18:38:12] iteration 1010/ 11920 | consumed samples: 1034240 | elapsed time per iteration (ms): 5652.4 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.453420E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:07:47.637362 | finish at 2025-09-10 11:45:59 + [2025-09-09 18:38:18] iteration 1011/ 11920 | consumed samples: 1035264 | elapsed time per iteration (ms): 5899.1 | throughput per GPU (TFLOP/s/GPU): 76.5 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.444281E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:52:33.040727 | finish at 2025-09-10 12:30:51 + [2025-09-09 18:38:23] iteration 1012/ 11920 | consumed samples: 1036288 | elapsed time per iteration (ms): 5651.1 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.429078E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:07:21.701202 | finish at 2025-09-10 11:45:45 + [2025-09-09 18:38:29] iteration 1013/ 11920 | consumed samples: 1037312 | elapsed time per iteration (ms): 5987.2 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.436807E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:08:22.614274 | finish at 2025-09-10 12:46:52 + [2025-09-09 18:38:35] iteration 1014/ 11920 | consumed samples: 1038336 | elapsed time per iteration (ms): 5652.6 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.439435E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:07:26.933721 | finish at 2025-09-10 11:46:02 + [2025-09-09 18:38:41] iteration 1015/ 11920 | consumed samples: 1039360 | elapsed time per iteration (ms): 5652.6 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.447832E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:07:21.101754 | finish at 2025-09-10 11:46:02 + [2025-09-09 18:38:46] iteration 1016/ 11920 | consumed samples: 1040384 | elapsed time per iteration (ms): 5670.4 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.436126E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:10:30.568298 | finish at 2025-09-10 11:49:17 + [2025-09-09 18:38:52] iteration 1017/ 11920 | consumed samples: 1041408 | elapsed time per iteration (ms): 5985.4 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.427082E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:07:39.171908 | finish at 2025-09-10 12:46:32 + [2025-09-09 18:38:58] iteration 1018/ 11920 | consumed samples: 1042432 | elapsed time per iteration (ms): 5650.5 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.449011E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:06:41.975179 | finish at 2025-09-10 11:45:40 + [2025-09-09 18:39:04] iteration 1019/ 11920 | consumed samples: 1043456 | elapsed time per iteration (ms): 5955.3 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.451210E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:01:58.871855 | finish at 2025-09-10 12:41:03 + [2025-09-09 18:39:10] iteration 1020/ 11920 | consumed samples: 1044480 | elapsed time per iteration (ms): 5652.3 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.435096E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:06:50.011530 | finish at 2025-09-10 11:46:00 + [2025-09-09 18:39:15] iteration 1021/ 11920 | consumed samples: 1045504 | elapsed time per iteration (ms): 5650.1 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.455391E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:06:20.190363 | finish at 2025-09-10 11:45:35 + [2025-09-09 18:39:21] iteration 1022/ 11920 | consumed samples: 1046528 | elapsed time per iteration (ms): 5650.4 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.427055E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:06:17.619254 | finish at 2025-09-10 11:45:39 + [2025-09-09 18:39:27] iteration 1023/ 11920 | consumed samples: 1047552 | elapsed time per iteration (ms): 5650.6 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.428940E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:06:14.501991 | finish at 2025-09-10 11:45:41 + [2025-09-09 18:39:32] iteration 1024/ 11920 | consumed samples: 1048576 | elapsed time per iteration (ms): 5662.3 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.427331E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:08:16.489540 | finish at 2025-09-10 11:47:49 + [2025-09-09 18:39:38] iteration 1025/ 11920 | consumed samples: 1049600 | elapsed time per iteration (ms): 6012.5 | throughput per GPU (TFLOP/s/GPU): 75.1 | MFU 7.59% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.422407E+00 | loss scale: 1.0 | grad norm: 0.129 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:11:46.247764 | finish at 2025-09-10 12:51:24 + [2025-09-09 18:39:44] iteration 1026/ 11920 | consumed samples: 1050624 | elapsed time per iteration (ms): 5651.7 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.419785E+00 | loss scale: 1.0 | grad norm: 0.119 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:06:09.209638 | finish at 2025-09-10 11:45:53 + [2025-09-09 18:39:50] iteration 1027/ 11920 | consumed samples: 1051648 | elapsed time per iteration (ms): 5654.3 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.437998E+00 | loss scale: 1.0 | grad norm: 0.114 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:06:32.287024 | finish at 2025-09-10 11:46:22 + [2025-09-09 18:39:55] iteration 1028/ 11920 | consumed samples: 1052672 | elapsed time per iteration (ms): 5644.9 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.431857E+00 | loss scale: 1.0 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:04:44.695772 | finish at 2025-09-10 11:44:40 + [2025-09-09 18:40:01] iteration 1029/ 11920 | consumed samples: 1053696 | elapsed time per iteration (ms): 5645.6 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.427740E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:04:46.710851 | finish at 2025-09-10 11:44:48 + [2025-09-09 18:40:06] iteration 1030/ 11920 | consumed samples: 1054720 | elapsed time per iteration (ms): 5651.2 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.430972E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:05:42.038555 | finish at 2025-09-10 11:45:49 + [2025-09-09 18:40:12] iteration 1031/ 11920 | consumed samples: 1055744 | elapsed time per iteration (ms): 5860.7 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.441218E+00 | loss scale: 1.0 | grad norm: 0.261 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:43:36.854780 | finish at 2025-09-10 12:23:49 + [2025-09-09 18:40:18] iteration 1032/ 11920 | consumed samples: 1056768 | elapsed time per iteration (ms): 5638.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.436763E+00 | loss scale: 1.0 | grad norm: 0.293 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:03:15.580458 | finish at 2025-09-10 11:43:34 + [2025-09-09 18:40:24] iteration 1033/ 11920 | consumed samples: 1057792 | elapsed time per iteration (ms): 5901.7 | throughput per GPU (TFLOP/s/GPU): 76.5 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.440572E+00 | loss scale: 1.0 | grad norm: 0.312 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:50:51.478467 | finish at 2025-09-10 12:31:15 + [2025-09-09 18:40:30] iteration 1034/ 11920 | consumed samples: 1058816 | elapsed time per iteration (ms): 5643.9 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.447301E+00 | loss scale: 1.0 | grad norm: 0.280 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:03:59.990229 | finish at 2025-09-10 11:44:30 + [2025-09-09 18:40:35] iteration 1035/ 11920 | consumed samples: 1059840 | elapsed time per iteration (ms): 5941.1 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.444778E+00 | loss scale: 1.0 | grad norm: 0.242 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:57:48.843670 | finish at 2025-09-10 12:38:24 + [2025-09-09 18:40:41] iteration 1036/ 11920 | consumed samples: 1060864 | elapsed time per iteration (ms): 5647.4 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.430567E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:04:26.414715 | finish at 2025-09-10 11:45:08 + [2025-09-09 18:40:47] iteration 1037/ 11920 | consumed samples: 1061888 | elapsed time per iteration (ms): 5653.7 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.423703E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:05:29.135302 | finish at 2025-09-10 11:46:16 + [2025-09-09 18:40:53] iteration 1038/ 11920 | consumed samples: 1062912 | elapsed time per iteration (ms): 5865.6 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.439201E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:43:49.631622 | finish at 2025-09-10 12:24:42 + [2025-09-09 18:40:58] iteration 1039/ 11920 | consumed samples: 1063936 | elapsed time per iteration (ms): 5651.0 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.426493E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:04:48.546814 | finish at 2025-09-10 11:45:47 + [2025-09-09 18:41:04] iteration 1040/ 11920 | consumed samples: 1064960 | elapsed time per iteration (ms): 5656.6 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.437371E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:05:43.317719 | finish at 2025-09-10 11:46:47 + [2025-09-09 18:41:10] iteration 1041/ 11920 | consumed samples: 1065984 | elapsed time per iteration (ms): 5643.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.436228E+00 | loss scale: 1.0 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:03:11.082840 | finish at 2025-09-10 11:44:21 + [2025-09-09 18:41:15] iteration 1042/ 11920 | consumed samples: 1067008 | elapsed time per iteration (ms): 5641.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.419698E+00 | loss scale: 1.0 | grad norm: 0.135 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:02:45.542294 | finish at 2025-09-10 11:44:01 + [2025-09-09 18:41:21] iteration 1043/ 11920 | consumed samples: 1068032 | elapsed time per iteration (ms): 5867.3 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.415404E+00 | loss scale: 1.0 | grad norm: 0.131 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:43:38.902539 | finish at 2025-09-10 12:25:00 + [2025-09-09 18:41:27] iteration 1044/ 11920 | consumed samples: 1069056 | elapsed time per iteration (ms): 5641.0 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.416187E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:02:31.959763 | finish at 2025-09-10 11:43:59 + [2025-09-09 18:41:33] iteration 1045/ 11920 | consumed samples: 1070080 | elapsed time per iteration (ms): 5959.3 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.415134E+00 | loss scale: 1.0 | grad norm: 0.131 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:00:07.507217 | finish at 2025-09-10 12:41:40 + [2025-09-09 18:41:38] iteration 1046/ 11920 | consumed samples: 1071104 | elapsed time per iteration (ms): 5655.7 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.426872E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:05:00.021827 | finish at 2025-09-10 11:46:38 + [2025-09-09 18:41:44] iteration 1047/ 11920 | consumed samples: 1072128 | elapsed time per iteration (ms): 5665.7 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.426182E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:06:42.759026 | finish at 2025-09-10 11:48:27 + [2025-09-09 18:41:50] iteration 1048/ 11920 | consumed samples: 1073152 | elapsed time per iteration (ms): 5893.3 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.416283E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:47:52.466263 | finish at 2025-09-10 12:29:42 + [2025-09-09 18:41:56] iteration 1049/ 11920 | consumed samples: 1074176 | elapsed time per iteration (ms): 5651.1 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.413110E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:03:52.772885 | finish at 2025-09-10 11:45:48 + [2025-09-09 18:42:01] iteration 1050/ 11920 | consumed samples: 1075200 | elapsed time per iteration (ms): 5655.8 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.415676E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:04:38.101375 | finish at 2025-09-10 11:46:39 + [2025-09-09 18:42:07] iteration 1051/ 11920 | consumed samples: 1076224 | elapsed time per iteration (ms): 5648.9 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.437711E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:03:18.130263 | finish at 2025-09-10 11:45:25 + [2025-09-09 18:42:13] iteration 1052/ 11920 | consumed samples: 1077248 | elapsed time per iteration (ms): 5653.1 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.407506E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:03:57.530782 | finish at 2025-09-10 11:46:10 + [2025-09-09 18:42:18] iteration 1053/ 11920 | consumed samples: 1078272 | elapsed time per iteration (ms): 5647.5 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.420803E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:02:50.856963 | finish at 2025-09-10 11:45:09 + [2025-09-09 18:42:24] iteration 1054/ 11920 | consumed samples: 1079296 | elapsed time per iteration (ms): 5654.0 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.418963E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:03:56.325617 | finish at 2025-09-10 11:46:20 + [2025-09-09 18:42:29] iteration 1055/ 11920 | consumed samples: 1080320 | elapsed time per iteration (ms): 5638.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.424642E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:01:03.802083 | finish at 2025-09-10 11:43:33 + [2025-09-09 18:42:36] iteration 1056/ 11920 | consumed samples: 1081344 | elapsed time per iteration (ms): 6041.3 | throughput per GPU (TFLOP/s/GPU): 74.7 | MFU 7.56% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.400367E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:13:52.603539 | finish at 2025-09-10 12:56:28 + [2025-09-09 18:42:41] iteration 1057/ 11920 | consumed samples: 1082368 | elapsed time per iteration (ms): 5642.5 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.434670E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:01:33.943143 | finish at 2025-09-10 11:44:15 + [2025-09-09 18:42:47] iteration 1058/ 11920 | consumed samples: 1083392 | elapsed time per iteration (ms): 5658.6 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.402469E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:04:23.885119 | finish at 2025-09-10 11:47:11 + [2025-09-09 18:42:52] iteration 1059/ 11920 | consumed samples: 1084416 | elapsed time per iteration (ms): 5643.5 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.410682E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:01:33.829190 | finish at 2025-09-10 11:44:26 + [2025-09-09 18:42:58] iteration 1060/ 11920 | consumed samples: 1085440 | elapsed time per iteration (ms): 5642.7 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.396012E+00 | loss scale: 1.0 | grad norm: 0.125 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:01:19.535108 | finish at 2025-09-10 11:44:18 + [2025-09-09 18:43:04] iteration 1061/ 11920 | consumed samples: 1086464 | elapsed time per iteration (ms): 5639.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.399000E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:00:40.934616 | finish at 2025-09-10 11:43:45 + [2025-09-09 18:43:09] iteration 1062/ 11920 | consumed samples: 1087488 | elapsed time per iteration (ms): 5645.2 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.409839E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:01:35.167553 | finish at 2025-09-10 11:44:45 + [2025-09-09 18:43:15] iteration 1063/ 11920 | consumed samples: 1088512 | elapsed time per iteration (ms): 5642.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.396602E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:00:56.609480 | finish at 2025-09-10 11:44:12 + [2025-09-09 18:43:21] iteration 1064/ 11920 | consumed samples: 1089536 | elapsed time per iteration (ms): 5647.7 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.406853E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:01:51.659742 | finish at 2025-09-10 11:45:12 + [2025-09-09 18:43:26] iteration 1065/ 11920 | consumed samples: 1090560 | elapsed time per iteration (ms): 5643.4 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.420272E+00 | loss scale: 1.0 | grad norm: 0.246 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:00:59.062502 | finish at 2025-09-10 11:44:25 + [2025-09-09 18:43:32] iteration 1066/ 11920 | consumed samples: 1091584 | elapsed time per iteration (ms): 5641.0 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.426815E+00 | loss scale: 1.0 | grad norm: 0.254 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:00:27.906033 | finish at 2025-09-10 11:44:00 + [2025-09-09 18:43:38] iteration 1067/ 11920 | consumed samples: 1092608 | elapsed time per iteration (ms): 5647.9 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.418426E+00 | loss scale: 1.0 | grad norm: 0.271 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:01:36.282051 | finish at 2025-09-10 11:45:14 + [2025-09-09 18:43:43] iteration 1068/ 11920 | consumed samples: 1093632 | elapsed time per iteration (ms): 5651.6 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.424881E+00 | loss scale: 1.0 | grad norm: 0.249 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:02:11.594024 | finish at 2025-09-10 11:45:55 + [2025-09-09 18:43:49] iteration 1069/ 11920 | consumed samples: 1094656 | elapsed time per iteration (ms): 5661.9 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.440554E+00 | loss scale: 1.0 | grad norm: 0.295 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:03:57.233392 | finish at 2025-09-10 11:47:46 + [2025-09-09 18:43:55] iteration 1070/ 11920 | consumed samples: 1095680 | elapsed time per iteration (ms): 5646.9 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.425295E+00 | loss scale: 1.0 | grad norm: 0.270 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:01:08.885028 | finish at 2025-09-10 11:45:03 + [2025-09-09 18:44:00] iteration 1071/ 11920 | consumed samples: 1096704 | elapsed time per iteration (ms): 5649.3 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.414948E+00 | loss scale: 1.0 | grad norm: 0.253 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:01:28.961895 | finish at 2025-09-10 11:45:29 + [2025-09-09 18:44:06] iteration 1072/ 11920 | consumed samples: 1097728 | elapsed time per iteration (ms): 5650.6 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.425969E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:01:37.907478 | finish at 2025-09-10 11:45:44 + [2025-09-09 18:44:12] iteration 1073/ 11920 | consumed samples: 1098752 | elapsed time per iteration (ms): 5645.5 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.414940E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:00:36.823242 | finish at 2025-09-10 11:44:48 + [2025-09-09 18:44:17] iteration 1074/ 11920 | consumed samples: 1099776 | elapsed time per iteration (ms): 5646.9 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.418958E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:00:46.429301 | finish at 2025-09-10 11:45:04 + [2025-09-09 18:44:23] iteration 1075/ 11920 | consumed samples: 1100800 | elapsed time per iteration (ms): 5641.2 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.401181E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:59:39.238758 | finish at 2025-09-10 11:44:02 + [2025-09-09 18:44:28] iteration 1076/ 11920 | consumed samples: 1101824 | elapsed time per iteration (ms): 5645.7 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.424318E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:00:22.138612 | finish at 2025-09-10 11:44:51 + [2025-09-09 18:44:35] iteration 1077/ 11920 | consumed samples: 1102848 | elapsed time per iteration (ms): 6158.7 | throughput per GPU (TFLOP/s/GPU): 73.3 | MFU 7.41% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.405835E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:32:58.920998 | finish at 2025-09-10 13:17:34 + [2025-09-09 18:44:40] iteration 1078/ 11920 | consumed samples: 1103872 | elapsed time per iteration (ms): 5839.1 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.409515E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:35:07.841475 | finish at 2025-09-10 12:19:48 + [2025-09-09 18:44:46] iteration 1079/ 11920 | consumed samples: 1104896 | elapsed time per iteration (ms): 5859.0 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.408979E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:38:37.746905 | finish at 2025-09-10 12:23:24 + [2025-09-09 18:44:52] iteration 1080/ 11920 | consumed samples: 1105920 | elapsed time per iteration (ms): 5644.2 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.399320E+00 | loss scale: 1.0 | grad norm: 0.130 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:59:43.227148 | finish at 2025-09-10 11:44:35 + [2025-09-09 18:44:58] iteration 1081/ 11920 | consumed samples: 1106944 | elapsed time per iteration (ms): 5649.6 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.393970E+00 | loss scale: 1.0 | grad norm: 0.122 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:00:35.722697 | finish at 2025-09-10 11:45:33 + [2025-09-09 18:45:03] iteration 1082/ 11920 | consumed samples: 1107968 | elapsed time per iteration (ms): 5652.0 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.393104E+00 | loss scale: 1.0 | grad norm: 0.132 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:00:56.261767 | finish at 2025-09-10 11:46:00 + [2025-09-09 18:45:09] iteration 1083/ 11920 | consumed samples: 1108992 | elapsed time per iteration (ms): 5635.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.393053E+00 | loss scale: 1.0 | grad norm: 0.121 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:57:54.127271 | finish at 2025-09-10 11:43:03 + [2025-09-09 18:45:15] iteration 1084/ 11920 | consumed samples: 1110016 | elapsed time per iteration (ms): 5637.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.378336E+00 | loss scale: 1.0 | grad norm: 0.121 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:58:07.451900 | finish at 2025-09-10 11:43:22 + [2025-09-09 18:45:20] iteration 1085/ 11920 | consumed samples: 1111040 | elapsed time per iteration (ms): 5645.5 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.397531E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:59:28.495914 | finish at 2025-09-10 11:44:49 + [2025-09-09 18:45:26] iteration 1086/ 11920 | consumed samples: 1112064 | elapsed time per iteration (ms): 5639.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.393920E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:58:15.782166 | finish at 2025-09-10 11:43:42 + [2025-09-09 18:45:31] iteration 1087/ 11920 | consumed samples: 1113088 | elapsed time per iteration (ms): 5644.9 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.388840E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:59:11.489295 | finish at 2025-09-10 11:44:43 + [2025-09-09 18:45:37] iteration 1088/ 11920 | consumed samples: 1114112 | elapsed time per iteration (ms): 5643.6 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.387095E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:58:51.751392 | finish at 2025-09-10 11:44:29 + [2025-09-09 18:45:43] iteration 1089/ 11920 | consumed samples: 1115136 | elapsed time per iteration (ms): 5850.7 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.391659E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:36:08.540705 | finish at 2025-09-10 12:21:51 + [2025-09-09 18:45:49] iteration 1090/ 11920 | consumed samples: 1116160 | elapsed time per iteration (ms): 6180.1 | throughput per GPU (TFLOP/s/GPU): 73.1 | MFU 7.39% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.398080E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:35:30.913818 | finish at 2025-09-10 13:21:20 + [2025-09-09 18:45:55] iteration 1091/ 11920 | consumed samples: 1117184 | elapsed time per iteration (ms): 5983.4 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.384718E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:59:53.818913 | finish at 2025-09-10 12:45:49 + [2025-09-09 18:46:01] iteration 1092/ 11920 | consumed samples: 1118208 | elapsed time per iteration (ms): 5860.1 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.385544E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:37:32.938536 | finish at 2025-09-10 12:23:34 + [2025-09-09 18:46:07] iteration 1093/ 11920 | consumed samples: 1119232 | elapsed time per iteration (ms): 6052.4 | throughput per GPU (TFLOP/s/GPU): 74.6 | MFU 7.54% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.381192E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:12:09.627708 | finish at 2025-09-10 12:58:17 + [2025-09-09 18:46:13] iteration 1094/ 11920 | consumed samples: 1120256 | elapsed time per iteration (ms): 6269.5 | throughput per GPU (TFLOP/s/GPU): 72.0 | MFU 7.28% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.396227E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:51:13.746566 | finish at 2025-09-10 13:37:27 + [2025-09-09 18:46:19] iteration 1095/ 11920 | consumed samples: 1121280 | elapsed time per iteration (ms): 5637.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.381228E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:57:06.190943 | finish at 2025-09-10 11:43:25 + [2025-09-09 18:46:25] iteration 1096/ 11920 | consumed samples: 1122304 | elapsed time per iteration (ms): 5955.1 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.394635E+00 | loss scale: 1.0 | grad norm: 0.245 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:54:17.659819 | finish at 2025-09-10 12:40:43 + [2025-09-09 18:46:31] iteration 1097/ 11920 | consumed samples: 1123328 | elapsed time per iteration (ms): 5894.9 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.398818E+00 | loss scale: 1.0 | grad norm: 0.303 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:43:20.508765 | finish at 2025-09-10 12:29:51 + [2025-09-09 18:46:36] iteration 1098/ 11920 | consumed samples: 1124352 | elapsed time per iteration (ms): 5652.0 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.415713E+00 | loss scale: 1.0 | grad norm: 0.343 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:59:25.442911 | finish at 2025-09-10 11:46:02 + [2025-09-09 18:46:42] iteration 1099/ 11920 | consumed samples: 1125376 | elapsed time per iteration (ms): 5646.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.412466E+00 | loss scale: 1.0 | grad norm: 0.356 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:58:18.860811 | finish at 2025-09-10 11:45:01 + [2025-09-09 18:46:48] iteration 1100/ 11920 | consumed samples: 1126400 | elapsed time per iteration (ms): 5661.1 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.444155E+00 | loss scale: 1.0 | grad norm: 0.375 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:00:52.564459 | finish at 2025-09-10 11:47:40 + [2025-09-09 18:46:53] iteration 1101/ 11920 | consumed samples: 1127424 | elapsed time per iteration (ms): 5649.5 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.434934E+00 | loss scale: 1.0 | grad norm: 0.383 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:58:41.926447 | finish at 2025-09-10 11:45:35 + [2025-09-09 18:46:59] iteration 1102/ 11920 | consumed samples: 1128448 | elapsed time per iteration (ms): 5656.2 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.459544E+00 | loss scale: 1.0 | grad norm: 0.382 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:59:48.280815 | finish at 2025-09-10 11:46:47 + [2025-09-09 18:47:05] iteration 1103/ 11920 | consumed samples: 1129472 | elapsed time per iteration (ms): 5671.5 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.449939E+00 | loss scale: 1.0 | grad norm: 0.362 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:02:28.798256 | finish at 2025-09-10 11:49:34 + [2025-09-09 18:47:10] iteration 1104/ 11920 | consumed samples: 1130496 | elapsed time per iteration (ms): 5659.5 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.454725E+00 | loss scale: 1.0 | grad norm: 0.334 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:00:13.264206 | finish at 2025-09-10 11:47:24 + [2025-09-09 18:47:16] iteration 1105/ 11920 | consumed samples: 1131520 | elapsed time per iteration (ms): 5655.6 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.451081E+00 | loss scale: 1.0 | grad norm: 0.351 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:59:25.608716 | finish at 2025-09-10 11:46:42 + [2025-09-09 18:47:22] iteration 1106/ 11920 | consumed samples: 1132544 | elapsed time per iteration (ms): 5642.4 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.466657E+00 | loss scale: 1.0 | grad norm: 0.330 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:56:56.514255 | finish at 2025-09-10 11:44:18 + [2025-09-09 18:47:27] iteration 1107/ 11920 | consumed samples: 1133568 | elapsed time per iteration (ms): 5660.2 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.450968E+00 | loss scale: 1.0 | grad norm: 0.305 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:00:03.692327 | finish at 2025-09-10 11:47:31 + [2025-09-09 18:47:33] iteration 1108/ 11920 | consumed samples: 1134592 | elapsed time per iteration (ms): 5653.7 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.432825E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:58:47.532380 | finish at 2025-09-10 11:46:21 + [2025-09-09 18:47:39] iteration 1109/ 11920 | consumed samples: 1135616 | elapsed time per iteration (ms): 5658.9 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.455132E+00 | loss scale: 1.0 | grad norm: 0.255 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:59:38.056261 | finish at 2025-09-10 11:47:17 + [2025-09-09 18:47:44] iteration 1110/ 11920 | consumed samples: 1136640 | elapsed time per iteration (ms): 5661.4 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.438375E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:59:59.441049 | finish at 2025-09-10 11:47:44 + [2025-09-09 18:47:50] iteration 1111/ 11920 | consumed samples: 1137664 | elapsed time per iteration (ms): 5658.9 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.427875E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:59:27.192082 | finish at 2025-09-10 11:47:17 + [2025-09-09 18:47:56] iteration 1112/ 11920 | consumed samples: 1138688 | elapsed time per iteration (ms): 5648.8 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.430302E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:57:32.450886 | finish at 2025-09-10 11:45:28 + [2025-09-09 18:48:01] iteration 1113/ 11920 | consumed samples: 1139712 | elapsed time per iteration (ms): 5655.1 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.409825E+00 | loss scale: 1.0 | grad norm: 0.124 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:58:34.130928 | finish at 2025-09-10 11:46:35 + [2025-09-09 18:48:07] iteration 1114/ 11920 | consumed samples: 1140736 | elapsed time per iteration (ms): 5652.3 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.416342E+00 | loss scale: 1.0 | grad norm: 0.122 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:57:59.249750 | finish at 2025-09-10 11:46:06 + [2025-09-09 18:48:13] iteration 1115/ 11920 | consumed samples: 1141760 | elapsed time per iteration (ms): 5638.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.395257E+00 | loss scale: 1.0 | grad norm: 0.112 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:55:22.866471 | finish at 2025-09-10 11:43:35 + [2025-09-09 18:48:18] iteration 1116/ 11920 | consumed samples: 1142784 | elapsed time per iteration (ms): 5640.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.405853E+00 | loss scale: 1.0 | grad norm: 0.100 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:55:37.410050 | finish at 2025-09-10 11:43:56 + [2025-09-09 18:48:24] iteration 1117/ 11920 | consumed samples: 1143808 | elapsed time per iteration (ms): 5647.5 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.389441E+00 | loss scale: 1.0 | grad norm: 0.094 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:56:49.932610 | finish at 2025-09-10 11:45:14 + [2025-09-09 18:48:30] iteration 1118/ 11920 | consumed samples: 1144832 | elapsed time per iteration (ms): 5644.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.408835E+00 | loss scale: 1.0 | grad norm: 0.091 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:56:09.579054 | finish at 2025-09-10 11:44:39 + [2025-09-09 18:48:35] iteration 1119/ 11920 | consumed samples: 1145856 | elapsed time per iteration (ms): 5645.9 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.386879E+00 | loss scale: 1.0 | grad norm: 0.091 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:56:21.105928 | finish at 2025-09-10 11:44:56 + [2025-09-09 18:48:41] iteration 1120/ 11920 | consumed samples: 1146880 | elapsed time per iteration (ms): 5649.8 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.381911E+00 | loss scale: 1.0 | grad norm: 0.082 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:56:58.162537 | finish at 2025-09-10 11:45:39 + [2025-09-09 18:48:46] iteration 1121/ 11920 | consumed samples: 1147904 | elapsed time per iteration (ms): 5646.9 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.395724E+00 | loss scale: 1.0 | grad norm: 0.076 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:56:21.037216 | finish at 2025-09-10 11:45:07 + [2025-09-09 18:48:52] iteration 1122/ 11920 | consumed samples: 1148928 | elapsed time per iteration (ms): 5649.3 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.384601E+00 | loss scale: 1.0 | grad norm: 0.090 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:56:41.268610 | finish at 2025-09-10 11:45:33 + [2025-09-09 18:48:58] iteration 1123/ 11920 | consumed samples: 1149952 | elapsed time per iteration (ms): 5647.0 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.385113E+00 | loss scale: 1.0 | grad norm: 0.090 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:56:10.500202 | finish at 2025-09-10 11:45:08 + [2025-09-09 18:49:04] iteration 1124/ 11920 | consumed samples: 1150976 | elapsed time per iteration (ms): 5974.2 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.380641E+00 | loss scale: 1.0 | grad norm: 0.078 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:54:57.095234 | finish at 2025-09-10 12:44:01 + [2025-09-09 18:49:09] iteration 1125/ 11920 | consumed samples: 1152000 | elapsed time per iteration (ms): 5635.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.384269E+00 | loss scale: 1.0 | grad norm: 0.071 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:53:56.946404 | finish at 2025-09-10 11:43:06 + [2025-09-09 18:49:15] iteration 1126/ 11920 | consumed samples: 1153024 | elapsed time per iteration (ms): 5962.7 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.385684E+00 | loss scale: 1.0 | grad norm: 0.086 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:52:41.439231 | finish at 2025-09-10 12:41:57 + [2025-09-09 18:49:21] iteration 1127/ 11920 | consumed samples: 1154048 | elapsed time per iteration (ms): 5634.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.365036E+00 | loss scale: 1.0 | grad norm: 0.089 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:53:27.448743 | finish at 2025-09-10 11:42:48 + [2025-09-09 18:49:27] iteration 1128/ 11920 | consumed samples: 1155072 | elapsed time per iteration (ms): 5950.3 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.370572E+00 | loss scale: 1.0 | grad norm: 0.110 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:50:15.392929 | finish at 2025-09-10 12:39:42 + [2025-09-09 18:49:33] iteration 1129/ 11920 | consumed samples: 1156096 | elapsed time per iteration (ms): 5637.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.383549E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:53:50.956999 | finish at 2025-09-10 11:43:23 + [2025-09-09 18:49:39] iteration 1130/ 11920 | consumed samples: 1157120 | elapsed time per iteration (ms): 5972.4 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.372468E+00 | loss scale: 1.0 | grad norm: 0.257 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:54:01.855886 | finish at 2025-09-10 12:43:40 + [2025-09-09 18:49:44] iteration 1131/ 11920 | consumed samples: 1158144 | elapsed time per iteration (ms): 5655.5 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.396226E+00 | loss scale: 1.0 | grad norm: 0.268 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:56:56.790094 | finish at 2025-09-10 11:46:41 + [2025-09-09 18:49:50] iteration 1132/ 11920 | consumed samples: 1159168 | elapsed time per iteration (ms): 5645.4 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.384922E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:55:02.940943 | finish at 2025-09-10 11:44:53 + [2025-09-09 18:49:55] iteration 1133/ 11920 | consumed samples: 1160192 | elapsed time per iteration (ms): 5642.0 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.395177E+00 | loss scale: 1.0 | grad norm: 0.274 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:54:20.654772 | finish at 2025-09-10 11:44:16 + [2025-09-09 18:50:01] iteration 1134/ 11920 | consumed samples: 1161216 | elapsed time per iteration (ms): 5661.8 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.366935E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:57:48.574971 | finish at 2025-09-10 11:47:50 + [2025-09-09 18:50:07] iteration 1135/ 11920 | consumed samples: 1162240 | elapsed time per iteration (ms): 5651.3 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.382661E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:55:49.635129 | finish at 2025-09-10 11:45:56 + [2025-09-09 18:50:13] iteration 1136/ 11920 | consumed samples: 1163264 | elapsed time per iteration (ms): 5887.2 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.383769E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:38:08.052002 | finish at 2025-09-10 12:28:21 + [2025-09-09 18:50:18] iteration 1137/ 11920 | consumed samples: 1164288 | elapsed time per iteration (ms): 5637.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.358520E+00 | loss scale: 1.0 | grad norm: 0.295 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:53:03.324574 | finish at 2025-09-10 11:43:22 + [2025-09-09 18:50:24] iteration 1138/ 11920 | consumed samples: 1165312 | elapsed time per iteration (ms): 5637.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.384867E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:53:06.913603 | finish at 2025-09-10 11:43:31 + [2025-09-09 18:50:30] iteration 1139/ 11920 | consumed samples: 1166336 | elapsed time per iteration (ms): 5642.0 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.364868E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:53:46.828253 | finish at 2025-09-10 11:44:16 + [2025-09-09 18:50:35] iteration 1140/ 11920 | consumed samples: 1167360 | elapsed time per iteration (ms): 5635.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.355439E+00 | loss scale: 1.0 | grad norm: 0.136 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:52:24.839840 | finish at 2025-09-10 11:43:00 + [2025-09-09 18:50:41] iteration 1141/ 11920 | consumed samples: 1168384 | elapsed time per iteration (ms): 5638.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.354095E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:52:59.878909 | finish at 2025-09-10 11:43:41 + [2025-09-09 18:50:47] iteration 1142/ 11920 | consumed samples: 1169408 | elapsed time per iteration (ms): 5651.5 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.375500E+00 | loss scale: 1.0 | grad norm: 0.114 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:55:12.403918 | finish at 2025-09-10 11:45:59 + [2025-09-09 18:50:52] iteration 1143/ 11920 | consumed samples: 1170432 | elapsed time per iteration (ms): 5641.9 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.367351E+00 | loss scale: 1.0 | grad norm: 0.108 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:53:22.435795 | finish at 2025-09-10 11:44:15 + [2025-09-09 18:50:58] iteration 1144/ 11920 | consumed samples: 1171456 | elapsed time per iteration (ms): 5989.1 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.361320E+00 | loss scale: 1.0 | grad norm: 0.132 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:55:39.070633 | finish at 2025-09-10 12:46:37 + [2025-09-09 18:51:04] iteration 1145/ 11920 | consumed samples: 1172480 | elapsed time per iteration (ms): 5645.7 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.359143E+00 | loss scale: 1.0 | grad norm: 0.114 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:53:52.936192 | finish at 2025-09-10 11:44:57 + [2025-09-09 18:51:09] iteration 1146/ 11920 | consumed samples: 1173504 | elapsed time per iteration (ms): 5637.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.357736E+00 | loss scale: 1.0 | grad norm: 0.094 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:52:22.810322 | finish at 2025-09-10 11:43:32 + [2025-09-09 18:51:15] iteration 1147/ 11920 | consumed samples: 1174528 | elapsed time per iteration (ms): 5637.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.349018E+00 | loss scale: 1.0 | grad norm: 0.102 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:52:13.340238 | finish at 2025-09-10 11:43:28 + [2025-09-09 18:51:21] iteration 1148/ 11920 | consumed samples: 1175552 | elapsed time per iteration (ms): 5938.4 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.351322E+00 | loss scale: 1.0 | grad norm: 0.106 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:46:08.953637 | finish at 2025-09-10 12:37:30 + [2025-09-09 18:51:27] iteration 1149/ 11920 | consumed samples: 1176576 | elapsed time per iteration (ms): 5638.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.361895E+00 | loss scale: 1.0 | grad norm: 0.095 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:52:14.070565 | finish at 2025-09-10 11:43:41 + [2025-09-09 18:51:32] iteration 1150/ 11920 | consumed samples: 1177600 | elapsed time per iteration (ms): 5639.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.340824E+00 | loss scale: 1.0 | grad norm: 0.115 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:52:12.465863 | finish at 2025-09-10 11:43:45 + [2025-09-09 18:51:38] iteration 1151/ 11920 | consumed samples: 1178624 | elapsed time per iteration (ms): 5847.8 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.361840E+00 | loss scale: 1.0 | grad norm: 0.135 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:29:34.909594 | finish at 2025-09-10 12:21:13 + [2025-09-09 18:51:44] iteration 1152/ 11920 | consumed samples: 1179648 | elapsed time per iteration (ms): 5637.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.338923E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:51:40.970364 | finish at 2025-09-10 11:43:25 + [2025-09-09 18:51:49] iteration 1153/ 11920 | consumed samples: 1180672 | elapsed time per iteration (ms): 5641.4 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.359580E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:52:21.170496 | finish at 2025-09-10 11:44:11 + [2025-09-09 18:51:55] iteration 1154/ 11920 | consumed samples: 1181696 | elapsed time per iteration (ms): 5643.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.358507E+00 | loss scale: 1.0 | grad norm: 0.305 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:52:35.398787 | finish at 2025-09-10 11:44:30 + [2025-09-09 18:52:01] iteration 1155/ 11920 | consumed samples: 1182720 | elapsed time per iteration (ms): 5668.5 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.369012E+00 | loss scale: 1.0 | grad norm: 0.303 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:57:01.502022 | finish at 2025-09-10 11:49:02 + [2025-09-09 18:52:07] iteration 1156/ 11920 | consumed samples: 1183744 | elapsed time per iteration (ms): 5882.3 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.372066E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:35:17.196742 | finish at 2025-09-10 12:27:24 + [2025-09-09 18:52:12] iteration 1157/ 11920 | consumed samples: 1184768 | elapsed time per iteration (ms): 5644.5 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.378125E+00 | loss scale: 1.0 | grad norm: 0.279 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:52:32.035956 | finish at 2025-09-10 11:44:44 + [2025-09-09 18:52:18] iteration 1158/ 11920 | consumed samples: 1185792 | elapsed time per iteration (ms): 5647.2 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.364582E+00 | loss scale: 1.0 | grad norm: 0.248 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:52:54.808337 | finish at 2025-09-10 11:45:13 + [2025-09-09 18:52:24] iteration 1159/ 11920 | consumed samples: 1186816 | elapsed time per iteration (ms): 5876.5 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.360834E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:33:56.961452 | finish at 2025-09-10 12:26:21 + [2025-09-09 18:52:29] iteration 1160/ 11920 | consumed samples: 1187840 | elapsed time per iteration (ms): 5637.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.359138E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:51:04.177208 | finish at 2025-09-10 11:43:34 + [2025-09-09 18:52:35] iteration 1161/ 11920 | consumed samples: 1188864 | elapsed time per iteration (ms): 5636.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.357331E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:50:47.545060 | finish at 2025-09-10 11:43:23 + [2025-09-09 18:52:41] iteration 1162/ 11920 | consumed samples: 1189888 | elapsed time per iteration (ms): 5647.7 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.360863E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:52:37.536723 | finish at 2025-09-10 11:45:18 + [2025-09-09 18:52:46] iteration 1163/ 11920 | consumed samples: 1190912 | elapsed time per iteration (ms): 5643.4 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.364090E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:51:46.494426 | finish at 2025-09-10 11:44:33 + [2025-09-09 18:52:52] iteration 1164/ 11920 | consumed samples: 1191936 | elapsed time per iteration (ms): 5641.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.351078E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:51:15.193861 | finish at 2025-09-10 11:44:07 + [2025-09-09 18:52:58] iteration 1165/ 11920 | consumed samples: 1192960 | elapsed time per iteration (ms): 5644.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.357240E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:51:42.515491 | finish at 2025-09-10 11:44:40 + [2025-09-09 18:53:03] iteration 1166/ 11920 | consumed samples: 1193984 | elapsed time per iteration (ms): 5652.5 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.357254E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:53:06.763577 | finish at 2025-09-10 11:46:10 + [2025-09-09 18:53:09] iteration 1167/ 11920 | consumed samples: 1195008 | elapsed time per iteration (ms): 5988.5 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.359860E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:53:13.903376 | finish at 2025-09-10 12:46:23 + [2025-09-09 18:53:15] iteration 1168/ 11920 | consumed samples: 1196032 | elapsed time per iteration (ms): 5645.0 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.366533E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:51:35.283325 | finish at 2025-09-10 11:44:50 + [2025-09-09 18:53:21] iteration 1169/ 11920 | consumed samples: 1197056 | elapsed time per iteration (ms): 5637.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.377571E+00 | loss scale: 1.0 | grad norm: 0.253 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:50:10.311208 | finish at 2025-09-10 11:43:31 + [2025-09-09 18:53:26] iteration 1170/ 11920 | consumed samples: 1198080 | elapsed time per iteration (ms): 5639.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.380608E+00 | loss scale: 1.0 | grad norm: 0.289 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:50:21.676505 | finish at 2025-09-10 11:43:48 + [2025-09-09 18:53:32] iteration 1171/ 11920 | consumed samples: 1199104 | elapsed time per iteration (ms): 6060.3 | throughput per GPU (TFLOP/s/GPU): 74.5 | MFU 7.53% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.382135E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:05:42.227408 | finish at 2025-09-10 12:59:14 + [2025-09-09 18:53:38] iteration 1172/ 11920 | consumed samples: 1200128 | elapsed time per iteration (ms): 5634.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.368544E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:49:22.248248 | finish at 2025-09-10 11:43:00 + [2025-09-09 18:53:44] iteration 1173/ 11920 | consumed samples: 1201152 | elapsed time per iteration (ms): 5860.9 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.382543E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:29:47.286230 | finish at 2025-09-10 12:23:31 + [2025-09-09 18:53:49] iteration 1174/ 11920 | consumed samples: 1202176 | elapsed time per iteration (ms): 5642.9 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.355780E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:50:38.616104 | finish at 2025-09-10 11:44:28 + [2025-09-09 18:53:55] iteration 1175/ 11920 | consumed samples: 1203200 | elapsed time per iteration (ms): 5639.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.349622E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:49:55.117371 | finish at 2025-09-10 11:43:50 + [2025-09-09 18:54:01] iteration 1176/ 11920 | consumed samples: 1204224 | elapsed time per iteration (ms): 5657.1 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.369245E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:53:00.263668 | finish at 2025-09-10 11:47:01 + [2025-09-09 18:54:07] iteration 1177/ 11920 | consumed samples: 1205248 | elapsed time per iteration (ms): 6034.3 | throughput per GPU (TFLOP/s/GPU): 74.8 | MFU 7.57% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.359765E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:00:26.170811 | finish at 2025-09-10 12:54:33 + [2025-09-09 18:54:12] iteration 1178/ 11920 | consumed samples: 1206272 | elapsed time per iteration (ms): 5631.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.351533E+00 | loss scale: 1.0 | grad norm: 0.134 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:48:15.931828 | finish at 2025-09-10 11:42:28 + [2025-09-09 18:54:18] iteration 1179/ 11920 | consumed samples: 1207296 | elapsed time per iteration (ms): 5632.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.355700E+00 | loss scale: 1.0 | grad norm: 0.137 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:48:17.706098 | finish at 2025-09-10 11:42:36 + [2025-09-09 18:54:24] iteration 1180/ 11920 | consumed samples: 1208320 | elapsed time per iteration (ms): 5631.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.350636E+00 | loss scale: 1.0 | grad norm: 0.120 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:48:02.883639 | finish at 2025-09-10 11:42:26 + [2025-09-09 18:54:29] iteration 1181/ 11920 | consumed samples: 1209344 | elapsed time per iteration (ms): 5626.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.350923E+00 | loss scale: 1.0 | grad norm: 0.118 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:47:06.840821 | finish at 2025-09-10 11:41:36 + [2025-09-09 18:54:35] iteration 1182/ 11920 | consumed samples: 1210368 | elapsed time per iteration (ms): 5633.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.347658E+00 | loss scale: 1.0 | grad norm: 0.132 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:48:10.378669 | finish at 2025-09-10 11:42:45 + [2025-09-09 18:54:41] iteration 1183/ 11920 | consumed samples: 1211392 | elapsed time per iteration (ms): 5634.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.360059E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:48:12.025725 | finish at 2025-09-10 11:42:53 + [2025-09-09 18:54:46] iteration 1184/ 11920 | consumed samples: 1212416 | elapsed time per iteration (ms): 5634.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.337404E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:48:11.470116 | finish at 2025-09-10 11:42:58 + [2025-09-09 18:54:52] iteration 1185/ 11920 | consumed samples: 1213440 | elapsed time per iteration (ms): 5636.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.352172E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:48:30.572492 | finish at 2025-09-10 11:43:22 + [2025-09-09 18:54:57] iteration 1186/ 11920 | consumed samples: 1214464 | elapsed time per iteration (ms): 5640.5 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.360149E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:49:05.196835 | finish at 2025-09-10 11:44:03 + [2025-09-09 18:55:03] iteration 1187/ 11920 | consumed samples: 1215488 | elapsed time per iteration (ms): 5647.2 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.329308E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:50:10.886965 | finish at 2025-09-10 11:45:14 + [2025-09-09 18:55:09] iteration 1188/ 11920 | consumed samples: 1216512 | elapsed time per iteration (ms): 5956.6 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.350675E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:45:26.143676 | finish at 2025-09-10 12:40:35 + [2025-09-09 18:55:15] iteration 1189/ 11920 | consumed samples: 1217536 | elapsed time per iteration (ms): 5637.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.343959E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:48:10.566030 | finish at 2025-09-10 11:43:25 + [2025-09-09 18:55:20] iteration 1190/ 11920 | consumed samples: 1218560 | elapsed time per iteration (ms): 5636.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.328565E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:48:03.949234 | finish at 2025-09-10 11:43:24 + [2025-09-09 18:55:26] iteration 1191/ 11920 | consumed samples: 1219584 | elapsed time per iteration (ms): 5639.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.327993E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:48:29.691233 | finish at 2025-09-10 11:43:56 + [2025-09-09 18:55:32] iteration 1192/ 11920 | consumed samples: 1220608 | elapsed time per iteration (ms): 5633.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.324647E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:47:10.362499 | finish at 2025-09-10 11:42:42 +(min, max) time across ranks (ms): + save-checkpoint ................................: (5611.28, 5611.37) + [2025-09-09 18:55:43] iteration 1193/ 11920 | consumed samples: 1221632 | elapsed time per iteration (ms): 5627.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.337554E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:46:02.129221 | finish at 2025-09-10 11:41:45 + [2025-09-09 18:55:48] iteration 1194/ 11920 | consumed samples: 1222656 | elapsed time per iteration (ms): 5638.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.337955E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:47:59.952122 | finish at 2025-09-10 11:43:48 + [2025-09-09 18:55:54] iteration 1195/ 11920 | consumed samples: 1223680 | elapsed time per iteration (ms): 5639.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.327679E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:48:05.377800 | finish at 2025-09-10 11:43:59 + [2025-09-09 18:56:00] iteration 1196/ 11920 | consumed samples: 1224704 | elapsed time per iteration (ms): 5999.9 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.335138E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:52:22.995177 | finish at 2025-09-10 12:48:23 + [2025-09-09 18:56:06] iteration 1197/ 11920 | consumed samples: 1225728 | elapsed time per iteration (ms): 5648.7 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.338720E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:49:30.930834 | finish at 2025-09-10 11:45:37 + [2025-09-09 18:56:12] iteration 1198/ 11920 | consumed samples: 1226752 | elapsed time per iteration (ms): 6040.2 | throughput per GPU (TFLOP/s/GPU): 74.7 | MFU 7.56% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.336371E+00 | loss scale: 1.0 | grad norm: 0.243 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:59:22.901643 | finish at 2025-09-10 12:55:35 + [2025-09-09 18:56:17] iteration 1199/ 11920 | consumed samples: 1227776 | elapsed time per iteration (ms): 5630.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.332947E+00 | loss scale: 1.0 | grad norm: 0.251 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:46:05.603548 | finish at 2025-09-10 11:42:23 + [2025-09-09 18:56:23] iteration 1200/ 11920 | consumed samples: 1228800 | elapsed time per iteration (ms): 5633.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.359000E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:46:29.475098 | finish at 2025-09-10 11:42:53 + [2025-09-09 18:56:29] iteration 1201/ 11920 | consumed samples: 1229824 | elapsed time per iteration (ms): 5628.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.333671E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:45:35.870419 | finish at 2025-09-10 11:42:05 + [2025-09-09 18:56:35] iteration 1202/ 11920 | consumed samples: 1230848 | elapsed time per iteration (ms): 5884.6 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.332246E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:31:11.397523 | finish at 2025-09-10 12:27:46 + [2025-09-09 18:56:40] iteration 1203/ 11920 | consumed samples: 1231872 | elapsed time per iteration (ms): 5629.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.354012E+00 | loss scale: 1.0 | grad norm: 0.489 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:45:27.193360 | finish at 2025-09-10 11:42:07 + [2025-09-09 18:56:46] iteration 1204/ 11920 | consumed samples: 1232896 | elapsed time per iteration (ms): 5637.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.345660E+00 | loss scale: 1.0 | grad norm: 0.243 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:46:54.733549 | finish at 2025-09-10 11:43:41 + [2025-09-09 18:56:51] iteration 1205/ 11920 | consumed samples: 1233920 | elapsed time per iteration (ms): 5635.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.344985E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:46:20.956217 | finish at 2025-09-10 11:43:12 + [2025-09-09 18:56:57] iteration 1206/ 11920 | consumed samples: 1234944 | elapsed time per iteration (ms): 5632.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.354128E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:45:46.857172 | finish at 2025-09-10 11:42:44 + [2025-09-09 18:57:03] iteration 1207/ 11920 | consumed samples: 1235968 | elapsed time per iteration (ms): 5642.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.338154E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:47:24.341932 | finish at 2025-09-10 11:44:27 + [2025-09-09 18:57:08] iteration 1208/ 11920 | consumed samples: 1236992 | elapsed time per iteration (ms): 5657.0 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.378045E+00 | loss scale: 1.0 | grad norm: 0.739 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:49:58.124369 | finish at 2025-09-10 11:47:07 + [2025-09-09 18:57:14] iteration 1209/ 11920 | consumed samples: 1238016 | elapsed time per iteration (ms): 5772.3 | throughput per GPU (TFLOP/s/GPU): 78.2 | MFU 7.91% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.597428E+00 | loss scale: 1.0 | grad norm: 4.434 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:10:26.689100 | finish at 2025-09-10 12:07:41 + [2025-09-09 18:57:20] iteration 1210/ 11920 | consumed samples: 1239040 | elapsed time per iteration (ms): 5666.2 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.537354E+00 | loss scale: 1.0 | grad norm: 0.693 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:51:24.577296 | finish at 2025-09-10 11:48:44 + [2025-09-09 18:57:26] iteration 1211/ 11920 | consumed samples: 1240064 | elapsed time per iteration (ms): 5677.4 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.663990E+00 | loss scale: 1.0 | grad norm: 1.342 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:53:19.670998 | finish at 2025-09-10 11:50:45 + [2025-09-09 18:57:31] iteration 1212/ 11920 | consumed samples: 1241088 | elapsed time per iteration (ms): 5700.3 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.767792E+00 | loss scale: 1.0 | grad norm: 1.476 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:57:18.687071 | finish at 2025-09-10 11:54:50 + [2025-09-09 18:57:37] iteration 1213/ 11920 | consumed samples: 1242112 | elapsed time per iteration (ms): 5706.9 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 8.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.720754E+00 | loss scale: 1.0 | grad norm: 0.826 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:58:23.371143 | finish at 2025-09-10 11:56:00 + [2025-09-09 18:57:43] iteration 1214/ 11920 | consumed samples: 1243136 | elapsed time per iteration (ms): 5682.1 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.668565E+00 | loss scale: 1.0 | grad norm: 0.796 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:53:52.994591 | finish at 2025-09-10 11:51:36 + [2025-09-09 18:57:48] iteration 1215/ 11920 | consumed samples: 1244160 | elapsed time per iteration (ms): 5702.7 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.626123E+00 | loss scale: 1.0 | grad norm: 0.548 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:57:27.203349 | finish at 2025-09-10 11:55:15 + [2025-09-09 18:57:54] iteration 1216/ 11920 | consumed samples: 1245184 | elapsed time per iteration (ms): 5693.5 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.908776E+00 | loss scale: 1.0 | grad norm: 1.528 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:55:43.747616 | finish at 2025-09-10 11:53:38 + [2025-09-09 18:58:00] iteration 1217/ 11920 | consumed samples: 1246208 | elapsed time per iteration (ms): 6040.1 | throughput per GPU (TFLOP/s/GPU): 74.7 | MFU 7.56% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.832045E+00 | loss scale: 1.0 | grad norm: 0.950 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:57:27.056100 | finish at 2025-09-10 12:55:27 + [2025-09-09 18:58:06] iteration 1218/ 11920 | consumed samples: 1247232 | elapsed time per iteration (ms): 6051.0 | throughput per GPU (TFLOP/s/GPU): 74.6 | MFU 7.54% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.683694E+00 | loss scale: 1.0 | grad norm: 0.418 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:59:18.050766 | finish at 2025-09-10 12:57:24 + [2025-09-09 18:58:12] iteration 1219/ 11920 | consumed samples: 1248256 | elapsed time per iteration (ms): 5703.4 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.696416E+00 | loss scale: 1.0 | grad norm: 0.494 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:57:12.054229 | finish at 2025-09-10 11:55:24 + [2025-09-09 18:58:18] iteration 1220/ 11920 | consumed samples: 1249280 | elapsed time per iteration (ms): 5937.2 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.693454E+00 | loss scale: 1.0 | grad norm: 0.495 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:38:47.658081 | finish at 2025-09-10 12:37:05 + [2025-09-09 18:58:23] iteration 1221/ 11920 | consumed samples: 1250304 | elapsed time per iteration (ms): 5678.9 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.665764E+00 | loss scale: 1.0 | grad norm: 0.537 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:52:38.778162 | finish at 2025-09-10 11:51:02 + [2025-09-09 18:58:29] iteration 1222/ 11920 | consumed samples: 1251328 | elapsed time per iteration (ms): 5918.9 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.647501E+00 | loss scale: 1.0 | grad norm: 0.364 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:35:20.695860 | finish at 2025-09-10 12:33:50 + [2025-09-09 18:58:35] iteration 1223/ 11920 | consumed samples: 1252352 | elapsed time per iteration (ms): 5683.7 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.618995E+00 | loss scale: 1.0 | grad norm: 0.339 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:53:18.264367 | finish at 2025-09-10 11:51:53 + [2025-09-09 18:58:41] iteration 1224/ 11920 | consumed samples: 1253376 | elapsed time per iteration (ms): 5690.1 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.612108E+00 | loss scale: 1.0 | grad norm: 0.327 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:54:20.975048 | finish at 2025-09-10 11:53:02 + [2025-09-09 18:58:47] iteration 1225/ 11920 | consumed samples: 1254400 | elapsed time per iteration (ms): 5898.5 | throughput per GPU (TFLOP/s/GPU): 76.5 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.610351E+00 | loss scale: 1.0 | grad norm: 0.364 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:31:24.181745 | finish at 2025-09-10 12:30:11 + [2025-09-09 18:58:52] iteration 1226/ 11920 | consumed samples: 1255424 | elapsed time per iteration (ms): 5670.1 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.598796E+00 | loss scale: 1.0 | grad norm: 0.385 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:50:36.115409 | finish at 2025-09-10 11:49:28 + [2025-09-09 18:58:58] iteration 1227/ 11920 | consumed samples: 1256448 | elapsed time per iteration (ms): 5928.5 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.558266E+00 | loss scale: 1.0 | grad norm: 0.309 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:36:33.725163 | finish at 2025-09-10 12:35:32 + [2025-09-09 18:59:04] iteration 1228/ 11920 | consumed samples: 1257472 | elapsed time per iteration (ms): 6021.6 | throughput per GPU (TFLOP/s/GPU): 75.0 | MFU 7.58% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.580907E+00 | loss scale: 1.0 | grad norm: 0.468 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:53:03.171613 | finish at 2025-09-10 12:52:07 + [2025-09-09 18:59:10] iteration 1229/ 11920 | consumed samples: 1258496 | elapsed time per iteration (ms): 5699.6 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.616798E+00 | loss scale: 1.0 | grad norm: 0.583 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:55:34.158311 | finish at 2025-09-10 11:54:44 + [2025-09-09 18:59:16] iteration 1230/ 11920 | consumed samples: 1259520 | elapsed time per iteration (ms): 5681.4 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.589150E+00 | loss scale: 1.0 | grad norm: 0.446 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:52:13.636520 | finish at 2025-09-10 11:51:29 + [2025-09-09 18:59:21] iteration 1231/ 11920 | consumed samples: 1260544 | elapsed time per iteration (ms): 5680.6 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.527184E+00 | loss scale: 1.0 | grad norm: 0.338 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:52:00.184927 | finish at 2025-09-10 11:51:21 + [2025-09-09 18:59:27] iteration 1232/ 11920 | consumed samples: 1261568 | elapsed time per iteration (ms): 6056.5 | throughput per GPU (TFLOP/s/GPU): 74.5 | MFU 7.54% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.549233E+00 | loss scale: 1.0 | grad norm: 0.366 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:58:52.044830 | finish at 2025-09-10 12:58:19 + [2025-09-09 18:59:33] iteration 1233/ 11920 | consumed samples: 1262592 | elapsed time per iteration (ms): 5726.7 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.550871E+00 | loss scale: 1.0 | grad norm: 0.537 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:00:01.419432 | finish at 2025-09-10 11:59:34 + [2025-09-09 18:59:39] iteration 1234/ 11920 | consumed samples: 1263616 | elapsed time per iteration (ms): 5684.7 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.643882E+00 | loss scale: 1.0 | grad norm: 0.749 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:52:27.234261 | finish at 2025-09-10 11:52:06 + [2025-09-09 18:59:45] iteration 1235/ 11920 | consumed samples: 1264640 | elapsed time per iteration (ms): 6017.5 | throughput per GPU (TFLOP/s/GPU): 75.0 | MFU 7.59% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.541837E+00 | loss scale: 1.0 | grad norm: 0.265 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:51:36.920450 | finish at 2025-09-10 12:51:22 + [2025-09-09 18:59:50] iteration 1236/ 11920 | consumed samples: 1265664 | elapsed time per iteration (ms): 5671.0 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.531966E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:49:48.826488 | finish at 2025-09-10 11:49:39 + [2025-09-09 18:59:56] iteration 1237/ 11920 | consumed samples: 1266688 | elapsed time per iteration (ms): 5666.2 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.529854E+00 | loss scale: 1.0 | grad norm: 0.240 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:48:51.799823 | finish at 2025-09-10 11:48:48 + [2025-09-09 19:00:02] iteration 1238/ 11920 | consumed samples: 1267712 | elapsed time per iteration (ms): 5676.5 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.507937E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:50:36.409531 | finish at 2025-09-10 11:50:38 + [2025-09-09 19:00:07] iteration 1239/ 11920 | consumed samples: 1268736 | elapsed time per iteration (ms): 5667.3 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.497797E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:48:52.958285 | finish at 2025-09-10 11:49:00 + [2025-09-09 19:00:13] iteration 1240/ 11920 | consumed samples: 1269760 | elapsed time per iteration (ms): 5654.7 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.475790E+00 | loss scale: 1.0 | grad norm: 0.119 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:46:31.776295 | finish at 2025-09-10 11:46:45 + [2025-09-09 19:00:19] iteration 1241/ 11920 | consumed samples: 1270784 | elapsed time per iteration (ms): 5658.6 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.465670E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:47:08.057986 | finish at 2025-09-10 11:47:27 + [2025-09-09 19:00:24] iteration 1242/ 11920 | consumed samples: 1271808 | elapsed time per iteration (ms): 5645.8 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.451958E+00 | loss scale: 1.0 | grad norm: 0.104 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:44:45.632126 | finish at 2025-09-10 11:45:10 + [2025-09-09 19:00:30] iteration 1243/ 11920 | consumed samples: 1272832 | elapsed time per iteration (ms): 5644.4 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.446359E+00 | loss scale: 1.0 | grad norm: 0.117 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:44:25.542640 | finish at 2025-09-10 11:44:56 + [2025-09-09 19:00:36] iteration 1244/ 11920 | consumed samples: 1273856 | elapsed time per iteration (ms): 5647.6 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.420733E+00 | loss scale: 1.0 | grad norm: 0.093 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:44:53.461287 | finish at 2025-09-10 11:45:29 + [2025-09-09 19:00:41] iteration 1245/ 11920 | consumed samples: 1274880 | elapsed time per iteration (ms): 5647.3 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.446603E+00 | loss scale: 1.0 | grad norm: 0.092 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:44:44.993726 | finish at 2025-09-10 11:45:26 + [2025-09-09 19:00:47] iteration 1246/ 11920 | consumed samples: 1275904 | elapsed time per iteration (ms): 5639.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.427505E+00 | loss scale: 1.0 | grad norm: 0.091 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:43:13.711211 | finish at 2025-09-10 11:44:01 + [2025-09-09 19:00:53] iteration 1247/ 11920 | consumed samples: 1276928 | elapsed time per iteration (ms): 5645.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.429206E+00 | loss scale: 1.0 | grad norm: 0.083 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:44:12.102741 | finish at 2025-09-10 11:45:05 + [2025-09-09 19:00:58] iteration 1248/ 11920 | consumed samples: 1277952 | elapsed time per iteration (ms): 5653.8 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.418259E+00 | loss scale: 1.0 | grad norm: 0.079 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:45:36.992409 | finish at 2025-09-10 11:46:35 + [2025-09-09 19:01:04] iteration 1249/ 11920 | consumed samples: 1278976 | elapsed time per iteration (ms): 5657.7 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.408349E+00 | loss scale: 1.0 | grad norm: 0.081 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:46:13.070575 | finish at 2025-09-10 11:47:17 + [2025-09-09 19:01:10] iteration 1250/ 11920 | consumed samples: 1280000 | elapsed time per iteration (ms): 5639.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.396600E+00 | loss scale: 1.0 | grad norm: 0.076 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:42:58.020134 | finish at 2025-09-10 11:44:08 + [2025-09-09 19:01:15] iteration 1251/ 11920 | consumed samples: 1281024 | elapsed time per iteration (ms): 5646.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.384977E+00 | loss scale: 1.0 | grad norm: 0.089 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:43:58.444867 | finish at 2025-09-10 11:45:14 + [2025-09-09 19:01:21] iteration 1252/ 11920 | consumed samples: 1282048 | elapsed time per iteration (ms): 5643.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.410594E+00 | loss scale: 1.0 | grad norm: 0.109 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:43:23.218431 | finish at 2025-09-10 11:44:44 + [2025-09-09 19:01:27] iteration 1253/ 11920 | consumed samples: 1283072 | elapsed time per iteration (ms): 5641.7 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.399389E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:43:00.525399 | finish at 2025-09-10 11:44:27 + [2025-09-09 19:01:32] iteration 1254/ 11920 | consumed samples: 1284096 | elapsed time per iteration (ms): 5645.8 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.398408E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:43:38.040438 | finish at 2025-09-10 11:45:10 + [2025-09-09 19:01:38] iteration 1255/ 11920 | consumed samples: 1285120 | elapsed time per iteration (ms): 5638.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.382035E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:42:13.483433 | finish at 2025-09-10 11:43:51 + [2025-09-09 19:01:43] iteration 1256/ 11920 | consumed samples: 1286144 | elapsed time per iteration (ms): 5638.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.380692E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:42:03.644835 | finish at 2025-09-10 11:43:47 + [2025-09-09 19:01:49] iteration 1257/ 11920 | consumed samples: 1287168 | elapsed time per iteration (ms): 5637.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.374571E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:41:49.586877 | finish at 2025-09-10 11:43:39 + [2025-09-09 19:01:55] iteration 1258/ 11920 | consumed samples: 1288192 | elapsed time per iteration (ms): 5638.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.360134E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:42:00.327893 | finish at 2025-09-10 11:43:55 + [2025-09-09 19:02:00] iteration 1259/ 11920 | consumed samples: 1289216 | elapsed time per iteration (ms): 5638.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.367324E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:41:51.008647 | finish at 2025-09-10 11:43:51 + [2025-09-09 19:02:06] iteration 1260/ 11920 | consumed samples: 1290240 | elapsed time per iteration (ms): 5650.4 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.351934E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:43:52.935324 | finish at 2025-09-10 11:45:59 + [2025-09-09 19:02:12] iteration 1261/ 11920 | consumed samples: 1291264 | elapsed time per iteration (ms): 5639.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.378743E+00 | loss scale: 1.0 | grad norm: 0.396 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:41:54.440908 | finish at 2025-09-10 11:44:06 + [2025-09-09 19:02:17] iteration 1262/ 11920 | consumed samples: 1292288 | elapsed time per iteration (ms): 5652.1 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.396882E+00 | loss scale: 1.0 | grad norm: 0.337 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:44:00.461338 | finish at 2025-09-10 11:46:18 + [2025-09-09 19:02:23] iteration 1263/ 11920 | consumed samples: 1293312 | elapsed time per iteration (ms): 5643.5 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.377765E+00 | loss scale: 1.0 | grad norm: 0.281 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:42:22.399331 | finish at 2025-09-10 11:44:45 + [2025-09-09 19:02:29] iteration 1264/ 11920 | consumed samples: 1294336 | elapsed time per iteration (ms): 5997.9 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.369365E+00 | loss scale: 1.0 | grad norm: 0.240 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:45:13.998505 | finish at 2025-09-10 12:47:43 + [2025-09-09 19:02:35] iteration 1265/ 11920 | consumed samples: 1295360 | elapsed time per iteration (ms): 5632.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.381940E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:40:12.097007 | finish at 2025-09-10 11:42:47 + [2025-09-09 19:02:40] iteration 1266/ 11920 | consumed samples: 1296384 | elapsed time per iteration (ms): 5648.9 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.359829E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:43:02.989764 | finish at 2025-09-10 11:45:43 + [2025-09-09 19:02:46] iteration 1267/ 11920 | consumed samples: 1297408 | elapsed time per iteration (ms): 5857.7 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.365776E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:20:02.188481 | finish at 2025-09-10 12:22:48 + [2025-09-09 19:02:52] iteration 1268/ 11920 | consumed samples: 1298432 | elapsed time per iteration (ms): 6122.3 | throughput per GPU (TFLOP/s/GPU): 73.7 | MFU 7.46% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.369176E+00 | loss scale: 1.0 | grad norm: 0.388 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:06:54.344994 | finish at 2025-09-10 13:09:47 + [2025-09-09 19:02:58] iteration 1269/ 11920 | consumed samples: 1299456 | elapsed time per iteration (ms): 5935.8 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.413869E+00 | loss scale: 1.0 | grad norm: 0.465 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:33:42.483399 | finish at 2025-09-10 12:36:41 + [2025-09-09 19:03:04] iteration 1270/ 11920 | consumed samples: 1300480 | elapsed time per iteration (ms): 5926.5 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.375433E+00 | loss scale: 1.0 | grad norm: 0.274 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:31:57.289352 | finish at 2025-09-10 12:35:01 + [2025-09-09 19:03:10] iteration 1271/ 11920 | consumed samples: 1301504 | elapsed time per iteration (ms): 6030.7 | throughput per GPU (TFLOP/s/GPU): 74.9 | MFU 7.57% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.404155E+00 | loss scale: 1.0 | grad norm: 0.400 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:50:21.358118 | finish at 2025-09-10 12:53:31 + [2025-09-09 19:03:16] iteration 1272/ 11920 | consumed samples: 1302528 | elapsed time per iteration (ms): 6088.6 | throughput per GPU (TFLOP/s/GPU): 74.2 | MFU 7.50% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.405447E+00 | loss scale: 1.0 | grad norm: 0.350 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:00:30.998146 | finish at 2025-09-10 13:03:47 + [2025-09-09 19:03:22] iteration 1273/ 11920 | consumed samples: 1303552 | elapsed time per iteration (ms): 5671.3 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.450905E+00 | loss scale: 1.0 | grad norm: 0.682 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:46:22.340555 | finish at 2025-09-10 11:49:44 + [2025-09-09 19:03:28] iteration 1274/ 11920 | consumed samples: 1304576 | elapsed time per iteration (ms): 5859.4 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.411161E+00 | loss scale: 1.0 | grad norm: 0.416 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:19:38.731114 | finish at 2025-09-10 12:23:06 + [2025-09-09 19:03:33] iteration 1275/ 11920 | consumed samples: 1305600 | elapsed time per iteration (ms): 5640.4 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.423111E+00 | loss scale: 1.0 | grad norm: 0.345 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:40:42.331386 | finish at 2025-09-10 11:44:16 + [2025-09-09 19:03:39] iteration 1276/ 11920 | consumed samples: 1306624 | elapsed time per iteration (ms): 5652.0 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.418669E+00 | loss scale: 1.0 | grad norm: 0.259 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:42:39.735209 | finish at 2025-09-10 11:46:19 + [2025-09-09 19:03:45] iteration 1277/ 11920 | consumed samples: 1307648 | elapsed time per iteration (ms): 5642.9 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.387116E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:40:57.087708 | finish at 2025-09-10 11:44:42 + [2025-09-09 19:03:50] iteration 1278/ 11920 | consumed samples: 1308672 | elapsed time per iteration (ms): 5654.7 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.393565E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:42:57.627379 | finish at 2025-09-10 11:46:48 + [2025-09-09 19:03:56] iteration 1279/ 11920 | consumed samples: 1309696 | elapsed time per iteration (ms): 5645.5 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.390678E+00 | loss scale: 1.0 | grad norm: 0.135 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:41:13.277805 | finish at 2025-09-10 11:45:09 + [2025-09-09 19:04:02] iteration 1280/ 11920 | consumed samples: 1310720 | elapsed time per iteration (ms): 5646.9 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.375127E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:41:22.997589 | finish at 2025-09-10 11:45:25 + [2025-09-09 19:04:07] iteration 1281/ 11920 | consumed samples: 1311744 | elapsed time per iteration (ms): 5647.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.386831E+00 | loss scale: 1.0 | grad norm: 0.117 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:41:19.209971 | finish at 2025-09-10 11:45:26 + [2025-09-09 19:04:13] iteration 1282/ 11920 | consumed samples: 1312768 | elapsed time per iteration (ms): 5633.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.371129E+00 | loss scale: 1.0 | grad norm: 0.120 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:38:47.492490 | finish at 2025-09-10 11:43:00 + [2025-09-09 19:04:19] iteration 1283/ 11920 | consumed samples: 1313792 | elapsed time per iteration (ms): 5646.2 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.375902E+00 | loss scale: 1.0 | grad norm: 0.101 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:40:59.067517 | finish at 2025-09-10 11:45:18 + [2025-09-09 19:04:24] iteration 1284/ 11920 | consumed samples: 1314816 | elapsed time per iteration (ms): 5634.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.374510E+00 | loss scale: 1.0 | grad norm: 0.102 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:38:52.054395 | finish at 2025-09-10 11:43:16 + [2025-09-09 19:04:30] iteration 1285/ 11920 | consumed samples: 1315840 | elapsed time per iteration (ms): 5967.9 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.352422E+00 | loss scale: 1.0 | grad norm: 0.082 | num zeros: 10.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:37:48.099643 | finish at 2025-09-10 12:42:18 + [2025-09-09 19:04:36] iteration 1286/ 11920 | consumed samples: 1316864 | elapsed time per iteration (ms): 5646.2 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.361837E+00 | loss scale: 1.0 | grad norm: 0.091 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:40:41.294666 | finish at 2025-09-10 11:45:17 + [2025-09-09 19:04:41] iteration 1287/ 11920 | consumed samples: 1317888 | elapsed time per iteration (ms): 5641.0 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.345175E+00 | loss scale: 1.0 | grad norm: 0.073 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:39:40.319842 | finish at 2025-09-10 11:44:22 + [2025-09-09 19:04:47] iteration 1288/ 11920 | consumed samples: 1318912 | elapsed time per iteration (ms): 5642.7 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.353897E+00 | loss scale: 1.0 | grad norm: 0.080 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:39:53.226500 | finish at 2025-09-10 11:44:40 + [2025-09-09 19:04:53] iteration 1289/ 11920 | consumed samples: 1319936 | elapsed time per iteration (ms): 5863.7 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.354679E+00 | loss scale: 1.0 | grad norm: 0.070 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:18:56.915202 | finish at 2025-09-10 12:23:50 + [2025-09-09 19:04:59] iteration 1290/ 11920 | consumed samples: 1320960 | elapsed time per iteration (ms): 6040.8 | throughput per GPU (TFLOP/s/GPU): 74.7 | MFU 7.56% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.343899E+00 | loss scale: 1.0 | grad norm: 0.068 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:50:13.182921 | finish at 2025-09-10 12:55:12 + [2025-09-09 19:05:05] iteration 1291/ 11920 | consumed samples: 1321984 | elapsed time per iteration (ms): 5642.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.352276E+00 | loss scale: 1.0 | grad norm: 0.066 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:39:31.681165 | finish at 2025-09-10 11:44:36 + [2025-09-09 19:05:10] iteration 1292/ 11920 | consumed samples: 1323008 | elapsed time per iteration (ms): 5639.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.347208E+00 | loss scale: 1.0 | grad norm: 0.060 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:38:52.266908 | finish at 2025-09-10 11:44:03 + [2025-09-09 19:05:17] iteration 1293/ 11920 | consumed samples: 1324032 | elapsed time per iteration (ms): 6332.0 | throughput per GPU (TFLOP/s/GPU): 71.3 | MFU 7.21% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.336048E+00 | loss scale: 1.0 | grad norm: 0.062 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:41:30.561969 | finish at 2025-09-10 13:46:47 + [2025-09-09 19:05:22] iteration 1294/ 11920 | consumed samples: 1325056 | elapsed time per iteration (ms): 5629.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.340147E+00 | loss scale: 1.0 | grad norm: 0.067 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:37:01.946584 | finish at 2025-09-10 11:42:24 + [2025-09-09 19:05:28] iteration 1295/ 11920 | consumed samples: 1326080 | elapsed time per iteration (ms): 5630.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.323895E+00 | loss scale: 1.0 | grad norm: 0.065 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:37:03.962003 | finish at 2025-09-10 11:42:32 + [2025-09-09 19:05:33] iteration 1296/ 11920 | consumed samples: 1327104 | elapsed time per iteration (ms): 5633.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.330188E+00 | loss scale: 1.0 | grad norm: 0.076 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:37:34.636414 | finish at 2025-09-10 11:43:08 + [2025-09-09 19:05:39] iteration 1297/ 11920 | consumed samples: 1328128 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.328449E+00 | loss scale: 1.0 | grad norm: 0.067 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:35:55.732538 | finish at 2025-09-10 11:41:35 + [2025-09-09 19:05:45] iteration 1298/ 11920 | consumed samples: 1329152 | elapsed time per iteration (ms): 5626.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.335424E+00 | loss scale: 1.0 | grad norm: 0.072 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:36:07.505563 | finish at 2025-09-10 11:41:52 + [2025-09-09 19:05:50] iteration 1299/ 11920 | consumed samples: 1330176 | elapsed time per iteration (ms): 5629.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.326106E+00 | loss scale: 1.0 | grad norm: 0.071 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:36:31.594677 | finish at 2025-09-10 11:42:22 + [2025-09-09 19:05:56] iteration 1300/ 11920 | consumed samples: 1331200 | elapsed time per iteration (ms): 5849.3 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.334838E+00 | loss scale: 1.0 | grad norm: 0.071 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:15:19.230795 | finish at 2025-09-10 12:21:15 + [2025-09-09 19:06:02] iteration 1301/ 11920 | consumed samples: 1332224 | elapsed time per iteration (ms): 5634.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.334317E+00 | loss scale: 1.0 | grad norm: 0.130 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:37:16.558497 | finish at 2025-09-10 11:43:18 + [2025-09-09 19:06:07] iteration 1302/ 11920 | consumed samples: 1333248 | elapsed time per iteration (ms): 5637.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.322283E+00 | loss scale: 1.0 | grad norm: 0.074 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:37:34.692160 | finish at 2025-09-10 11:43:42 + [2025-09-09 19:06:13] iteration 1303/ 11920 | consumed samples: 1334272 | elapsed time per iteration (ms): 5926.7 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.307109E+00 | loss scale: 1.0 | grad norm: 0.091 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:28:43.580214 | finish at 2025-09-10 12:34:57 + [2025-09-09 19:06:19] iteration 1304/ 11920 | consumed samples: 1335296 | elapsed time per iteration (ms): 5630.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.312181E+00 | loss scale: 1.0 | grad norm: 0.124 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:36:15.188408 | finish at 2025-09-10 11:42:34 + [2025-09-09 19:06:25] iteration 1305/ 11920 | consumed samples: 1336320 | elapsed time per iteration (ms): 5633.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.317969E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:36:38.819001 | finish at 2025-09-10 11:43:03 + [2025-09-09 19:06:30] iteration 1306/ 11920 | consumed samples: 1337344 | elapsed time per iteration (ms): 5629.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.312696E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:35:47.306253 | finish at 2025-09-10 11:42:18 + [2025-09-09 19:06:36] iteration 1307/ 11920 | consumed samples: 1338368 | elapsed time per iteration (ms): 5627.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.308476E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:35:26.355963 | finish at 2025-09-10 11:42:02 + [2025-09-09 19:06:42] iteration 1308/ 11920 | consumed samples: 1339392 | elapsed time per iteration (ms): 5631.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.314567E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:36:04.863332 | finish at 2025-09-10 11:42:46 + [2025-09-09 19:06:47] iteration 1309/ 11920 | consumed samples: 1340416 | elapsed time per iteration (ms): 5874.3 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.322100E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:18:51.838092 | finish at 2025-09-10 12:25:39 + [2025-09-09 19:06:53] iteration 1310/ 11920 | consumed samples: 1341440 | elapsed time per iteration (ms): 5856.3 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.319264E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:15:34.856009 | finish at 2025-09-10 12:22:28 + [2025-09-09 19:06:59] iteration 1311/ 11920 | consumed samples: 1342464 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.336894E+00 | loss scale: 1.0 | grad norm: 0.136 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:34:49.711132 | finish at 2025-09-10 11:41:49 + [2025-09-09 19:07:05] iteration 1312/ 11920 | consumed samples: 1343488 | elapsed time per iteration (ms): 5641.0 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.310186E+00 | loss scale: 1.0 | grad norm: 0.125 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:37:20.019196 | finish at 2025-09-10 11:44:25 + [2025-09-09 19:07:10] iteration 1313/ 11920 | consumed samples: 1344512 | elapsed time per iteration (ms): 5632.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.315913E+00 | loss scale: 1.0 | grad norm: 0.121 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:35:47.902232 | finish at 2025-09-10 11:42:58 + [2025-09-09 19:07:16] iteration 1314/ 11920 | consumed samples: 1345536 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.314765E+00 | loss scale: 1.0 | grad norm: 0.103 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:33:50.894199 | finish at 2025-09-10 11:41:07 + [2025-09-09 19:07:22] iteration 1315/ 11920 | consumed samples: 1346560 | elapsed time per iteration (ms): 5868.0 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.302583E+00 | loss scale: 1.0 | grad norm: 0.107 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:17:09.621996 | finish at 2025-09-10 12:24:31 + [2025-09-09 19:07:28] iteration 1316/ 11920 | consumed samples: 1347584 | elapsed time per iteration (ms): 5971.6 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.302731E+00 | loss scale: 1.0 | grad norm: 0.128 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:35:23.051774 | finish at 2025-09-10 12:42:51 + [2025-09-09 19:07:33] iteration 1317/ 11920 | consumed samples: 1348608 | elapsed time per iteration (ms): 5631.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.287365E+00 | loss scale: 1.0 | grad norm: 0.124 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:35:05.460582 | finish at 2025-09-10 11:42:39 + [2025-09-09 19:07:39] iteration 1318/ 11920 | consumed samples: 1349632 | elapsed time per iteration (ms): 5629.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.305178E+00 | loss scale: 1.0 | grad norm: 0.112 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:34:45.853855 | finish at 2025-09-10 11:42:25 + [2025-09-09 19:07:45] iteration 1319/ 11920 | consumed samples: 1350656 | elapsed time per iteration (ms): 5627.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.303133E+00 | loss scale: 1.0 | grad norm: 0.114 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:34:18.601625 | finish at 2025-09-10 11:42:03 + [2025-09-09 19:07:50] iteration 1320/ 11920 | consumed samples: 1351680 | elapsed time per iteration (ms): 5626.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.292163E+00 | loss scale: 1.0 | grad norm: 0.129 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:34:05.364475 | finish at 2025-09-10 11:41:56 + [2025-09-09 19:07:56] iteration 1321/ 11920 | consumed samples: 1352704 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.303386E+00 | loss scale: 1.0 | grad norm: 0.130 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:33:26.219446 | finish at 2025-09-10 11:41:22 + [2025-09-09 19:08:01] iteration 1322/ 11920 | consumed samples: 1353728 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.303949E+00 | loss scale: 1.0 | grad norm: 0.137 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:33:26.427449 | finish at 2025-09-10 11:41:28 + [2025-09-09 19:08:07] iteration 1323/ 11920 | consumed samples: 1354752 | elapsed time per iteration (ms): 5635.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.305979E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:35:19.683565 | finish at 2025-09-10 11:43:27 + [2025-09-09 19:08:13] iteration 1324/ 11920 | consumed samples: 1355776 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.299734E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:33:12.311500 | finish at 2025-09-10 11:41:25 + [2025-09-09 19:08:18] iteration 1325/ 11920 | consumed samples: 1356800 | elapsed time per iteration (ms): 5628.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.290844E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:33:57.286665 | finish at 2025-09-10 11:42:16 + [2025-09-09 19:08:24] iteration 1326/ 11920 | consumed samples: 1357824 | elapsed time per iteration (ms): 5632.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.291498E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:34:31.666625 | finish at 2025-09-10 11:42:56 + [2025-09-09 19:08:30] iteration 1327/ 11920 | consumed samples: 1358848 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.304551E+00 | loss scale: 1.0 | grad norm: 0.132 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:32:37.118914 | finish at 2025-09-10 11:41:07 + [2025-09-09 19:08:35] iteration 1328/ 11920 | consumed samples: 1359872 | elapsed time per iteration (ms): 5941.2 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.304861E+00 | loss scale: 1.0 | grad norm: 0.098 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:28:49.551712 | finish at 2025-09-10 12:37:25 + [2025-09-09 19:08:41] iteration 1329/ 11920 | consumed samples: 1360896 | elapsed time per iteration (ms): 5973.9 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.294427E+00 | loss scale: 1.0 | grad norm: 0.084 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:34:29.792902 | finish at 2025-09-10 12:43:11 + [2025-09-09 19:08:47] iteration 1330/ 11920 | consumed samples: 1361920 | elapsed time per iteration (ms): 5627.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.288569E+00 | loss scale: 1.0 | grad norm: 0.087 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:33:17.859027 | finish at 2025-09-10 11:42:05 + [2025-09-09 19:08:53] iteration 1331/ 11920 | consumed samples: 1362944 | elapsed time per iteration (ms): 6351.2 | throughput per GPU (TFLOP/s/GPU): 71.1 | MFU 7.19% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.286209E+00 | loss scale: 1.0 | grad norm: 0.077 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:40:52.772062 | finish at 2025-09-10 13:49:46 + [2025-09-09 19:08:59] iteration 1332/ 11920 | consumed samples: 1363968 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.292109E+00 | loss scale: 1.0 | grad norm: 0.081 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:31:47.913684 | finish at 2025-09-10 11:40:47 + [2025-09-09 19:09:05] iteration 1333/ 11920 | consumed samples: 1364992 | elapsed time per iteration (ms): 5634.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.296536E+00 | loss scale: 1.0 | grad norm: 0.075 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:34:07.509521 | finish at 2025-09-10 11:43:12 + [2025-09-09 19:09:10] iteration 1334/ 11920 | consumed samples: 1366016 | elapsed time per iteration (ms): 5630.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.289284E+00 | loss scale: 1.0 | grad norm: 0.096 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:33:26.821054 | finish at 2025-09-10 11:42:37 + [2025-09-09 19:09:16] iteration 1335/ 11920 | consumed samples: 1367040 | elapsed time per iteration (ms): 5930.2 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.289083E+00 | loss scale: 1.0 | grad norm: 0.081 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:26:11.531465 | finish at 2025-09-10 12:35:28 + [2025-09-09 19:09:22] iteration 1336/ 11920 | consumed samples: 1368064 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.297970E+00 | loss scale: 1.0 | grad norm: 0.096 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:32:05.347910 | finish at 2025-09-10 11:41:27 + [2025-09-09 19:09:28] iteration 1337/ 11920 | consumed samples: 1369088 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.308555E+00 | loss scale: 1.0 | grad norm: 0.122 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:31:50.935573 | finish at 2025-09-10 11:41:18 + [2025-09-09 19:09:33] iteration 1338/ 11920 | consumed samples: 1370112 | elapsed time per iteration (ms): 5854.3 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.285795E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:12:30.215449 | finish at 2025-09-10 12:22:04 + [2025-09-09 19:09:39] iteration 1339/ 11920 | consumed samples: 1371136 | elapsed time per iteration (ms): 5632.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.309679E+00 | loss scale: 1.0 | grad norm: 0.245 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:33:13.755755 | finish at 2025-09-10 11:42:53 + [2025-09-09 19:09:45] iteration 1340/ 11920 | consumed samples: 1372160 | elapsed time per iteration (ms): 5625.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.303053E+00 | loss scale: 1.0 | grad norm: 0.331 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:32:02.499065 | finish at 2025-09-10 11:41:47 + [2025-09-09 19:09:50] iteration 1341/ 11920 | consumed samples: 1373184 | elapsed time per iteration (ms): 5630.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.298801E+00 | loss scale: 1.0 | grad norm: 0.283 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:32:39.531599 | finish at 2025-09-10 11:42:30 + [2025-09-09 19:09:56] iteration 1342/ 11920 | consumed samples: 1374208 | elapsed time per iteration (ms): 5629.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.295577E+00 | loss scale: 1.0 | grad norm: 0.250 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:32:23.823742 | finish at 2025-09-10 11:42:20 + [2025-09-09 19:10:02] iteration 1343/ 11920 | consumed samples: 1375232 | elapsed time per iteration (ms): 5627.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.321685E+00 | loss scale: 1.0 | grad norm: 0.270 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:31:57.332253 | finish at 2025-09-10 11:41:59 + [2025-09-09 19:10:07] iteration 1344/ 11920 | consumed samples: 1376256 | elapsed time per iteration (ms): 5639.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.292459E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:34:05.829620 | finish at 2025-09-10 11:44:13 + [2025-09-09 19:10:13] iteration 1345/ 11920 | consumed samples: 1377280 | elapsed time per iteration (ms): 5630.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.300732E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:32:19.381689 | finish at 2025-09-10 11:42:32 + [2025-09-09 19:10:18] iteration 1346/ 11920 | consumed samples: 1378304 | elapsed time per iteration (ms): 5634.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.297107E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:33:02.692399 | finish at 2025-09-10 11:43:21 + [2025-09-09 19:10:24] iteration 1347/ 11920 | consumed samples: 1379328 | elapsed time per iteration (ms): 5635.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.287350E+00 | loss scale: 1.0 | grad norm: 0.131 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:33:06.344195 | finish at 2025-09-10 11:43:30 + [2025-09-09 19:10:30] iteration 1348/ 11920 | consumed samples: 1380352 | elapsed time per iteration (ms): 5937.9 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.294791E+00 | loss scale: 1.0 | grad norm: 0.127 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:26:15.280377 | finish at 2025-09-10 12:36:45 + [2025-09-09 19:10:36] iteration 1349/ 11920 | consumed samples: 1381376 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.290626E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:30:56.189152 | finish at 2025-09-10 11:41:32 + [2025-09-09 19:10:41] iteration 1350/ 11920 | consumed samples: 1382400 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.278733E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:31:20.309246 | finish at 2025-09-10 11:42:02 + [2025-09-09 19:10:47] iteration 1351/ 11920 | consumed samples: 1383424 | elapsed time per iteration (ms): 5626.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.288655E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:31:08.672137 | finish at 2025-09-10 11:41:56 + [2025-09-09 19:10:53] iteration 1352/ 11920 | consumed samples: 1384448 | elapsed time per iteration (ms): 5635.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.297620E+00 | loss scale: 1.0 | grad norm: 0.255 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:32:37.482840 | finish at 2025-09-10 11:43:30 + [2025-09-09 19:10:58] iteration 1353/ 11920 | consumed samples: 1385472 | elapsed time per iteration (ms): 5641.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.293185E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:33:31.871165 | finish at 2025-09-10 11:44:30 + [2025-09-09 19:11:04] iteration 1354/ 11920 | consumed samples: 1386496 | elapsed time per iteration (ms): 5636.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.285972E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:32:34.874843 | finish at 2025-09-10 11:43:39 + [2025-09-09 19:11:09] iteration 1355/ 11920 | consumed samples: 1387520 | elapsed time per iteration (ms): 5630.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.302423E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:31:21.104861 | finish at 2025-09-10 11:42:31 + [2025-09-09 19:11:15] iteration 1356/ 11920 | consumed samples: 1388544 | elapsed time per iteration (ms): 5631.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.290461E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:31:29.002537 | finish at 2025-09-10 11:42:44 + [2025-09-09 19:11:21] iteration 1357/ 11920 | consumed samples: 1389568 | elapsed time per iteration (ms): 5632.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.288070E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:31:33.321501 | finish at 2025-09-10 11:42:54 + [2025-09-09 19:11:26] iteration 1358/ 11920 | consumed samples: 1390592 | elapsed time per iteration (ms): 5629.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.301948E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:31:02.187685 | finish at 2025-09-10 11:42:28 + [2025-09-09 19:11:32] iteration 1359/ 11920 | consumed samples: 1391616 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.286085E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:30:11.620212 | finish at 2025-09-10 11:41:44 + [2025-09-09 19:11:38] iteration 1360/ 11920 | consumed samples: 1392640 | elapsed time per iteration (ms): 5629.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.298718E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:30:47.629852 | finish at 2025-09-10 11:42:25 + [2025-09-09 19:11:43] iteration 1361/ 11920 | consumed samples: 1393664 | elapsed time per iteration (ms): 5628.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.296815E+00 | loss scale: 1.0 | grad norm: 0.244 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:30:32.363498 | finish at 2025-09-10 11:42:16 + [2025-09-09 19:11:49] iteration 1362/ 11920 | consumed samples: 1394688 | elapsed time per iteration (ms): 5638.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.279193E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:32:06.283533 | finish at 2025-09-10 11:43:55 + [2025-09-09 19:11:54] iteration 1363/ 11920 | consumed samples: 1395712 | elapsed time per iteration (ms): 5632.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.277413E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:31:02.890770 | finish at 2025-09-10 11:42:57 + [2025-09-09 19:12:00] iteration 1364/ 11920 | consumed samples: 1396736 | elapsed time per iteration (ms): 5626.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.284104E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:29:54.291733 | finish at 2025-09-10 11:41:54 + [2025-09-09 19:12:06] iteration 1365/ 11920 | consumed samples: 1397760 | elapsed time per iteration (ms): 5632.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.286749E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:30:55.090890 | finish at 2025-09-10 11:43:01 + [2025-09-09 19:12:11] iteration 1366/ 11920 | consumed samples: 1398784 | elapsed time per iteration (ms): 5626.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.273795E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:29:45.116990 | finish at 2025-09-10 11:41:56 + [2025-09-09 19:12:17] iteration 1367/ 11920 | consumed samples: 1399808 | elapsed time per iteration (ms): 5634.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.295027E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:31:02.740645 | finish at 2025-09-10 11:43:20 + [2025-09-09 19:12:23] iteration 1368/ 11920 | consumed samples: 1400832 | elapsed time per iteration (ms): 5644.7 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.288018E+00 | loss scale: 1.0 | grad norm: 0.326 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:32:43.171795 | finish at 2025-09-10 11:45:06 + [2025-09-09 19:12:28] iteration 1369/ 11920 | consumed samples: 1401856 | elapsed time per iteration (ms): 5636.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.305066E+00 | loss scale: 1.0 | grad norm: 0.346 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:31:10.005897 | finish at 2025-09-10 11:43:38 + [2025-09-09 19:12:34] iteration 1370/ 11920 | consumed samples: 1402880 | elapsed time per iteration (ms): 5631.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.289384E+00 | loss scale: 1.0 | grad norm: 0.272 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:30:07.244122 | finish at 2025-09-10 11:42:41 + [2025-09-09 19:12:40] iteration 1371/ 11920 | consumed samples: 1403904 | elapsed time per iteration (ms): 5867.7 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.309548E+00 | loss scale: 1.0 | grad norm: 0.253 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:11:38.476503 | finish at 2025-09-10 12:24:18 + [2025-09-09 19:12:45] iteration 1372/ 11920 | consumed samples: 1404928 | elapsed time per iteration (ms): 5639.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.295961E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:31:21.800972 | finish at 2025-09-10 11:44:07 + [2025-09-09 19:12:51] iteration 1373/ 11920 | consumed samples: 1405952 | elapsed time per iteration (ms): 5644.8 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.292553E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:32:16.165221 | finish at 2025-09-10 11:45:07 + [2025-09-09 19:12:57] iteration 1374/ 11920 | consumed samples: 1406976 | elapsed time per iteration (ms): 5638.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.297516E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:31:00.585903 | finish at 2025-09-10 11:43:57 + [2025-09-09 19:13:02] iteration 1375/ 11920 | consumed samples: 1408000 | elapsed time per iteration (ms): 5646.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.285816E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:32:19.771718 | finish at 2025-09-10 11:45:22 + [2025-09-09 19:13:08] iteration 1376/ 11920 | consumed samples: 1409024 | elapsed time per iteration (ms): 5642.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.282555E+00 | loss scale: 1.0 | grad norm: 0.127 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:31:30.293354 | finish at 2025-09-10 11:44:38 + [2025-09-09 19:13:14] iteration 1377/ 11920 | consumed samples: 1410048 | elapsed time per iteration (ms): 5851.8 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.294708E+00 | loss scale: 1.0 | grad norm: 0.121 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:08:15.396320 | finish at 2025-09-10 12:21:29 + [2025-09-09 19:13:19] iteration 1378/ 11920 | consumed samples: 1411072 | elapsed time per iteration (ms): 5637.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.274234E+00 | loss scale: 1.0 | grad norm: 0.103 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:30:32.222054 | finish at 2025-09-10 11:43:52 + [2025-09-09 19:13:25] iteration 1379/ 11920 | consumed samples: 1412096 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.273822E+00 | loss scale: 1.0 | grad norm: 0.099 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:28:13.436633 | finish at 2025-09-10 11:41:39 + [2025-09-09 19:13:31] iteration 1380/ 11920 | consumed samples: 1413120 | elapsed time per iteration (ms): 5638.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.268674E+00 | loss scale: 1.0 | grad norm: 0.087 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:30:30.196834 | finish at 2025-09-10 11:44:01 + [2025-09-09 19:13:36] iteration 1381/ 11920 | consumed samples: 1414144 | elapsed time per iteration (ms): 5637.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.289484E+00 | loss scale: 1.0 | grad norm: 0.083 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:30:14.841710 | finish at 2025-09-10 11:43:51 + [2025-09-09 19:13:42] iteration 1382/ 11920 | consumed samples: 1415168 | elapsed time per iteration (ms): 5845.2 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.281581E+00 | loss scale: 1.0 | grad norm: 0.081 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:06:36.673027 | finish at 2025-09-10 12:20:19 + [2025-09-09 19:13:48] iteration 1383/ 11920 | consumed samples: 1416192 | elapsed time per iteration (ms): 5632.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.274711E+00 | loss scale: 1.0 | grad norm: 0.089 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:29:05.687519 | finish at 2025-09-10 11:42:54 + [2025-09-09 19:13:53] iteration 1384/ 11920 | consumed samples: 1417216 | elapsed time per iteration (ms): 5626.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.278792E+00 | loss scale: 1.0 | grad norm: 0.085 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:27:59.549377 | finish at 2025-09-10 11:41:53 + [2025-09-09 19:13:59] iteration 1385/ 11920 | consumed samples: 1418240 | elapsed time per iteration (ms): 5634.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.276319E+00 | loss scale: 1.0 | grad norm: 0.094 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:29:20.459965 | finish at 2025-09-10 11:43:20 + [2025-09-09 19:14:05] iteration 1386/ 11920 | consumed samples: 1419264 | elapsed time per iteration (ms): 5644.2 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.278336E+00 | loss scale: 1.0 | grad norm: 0.098 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:30:56.500989 | finish at 2025-09-10 11:45:01 + [2025-09-09 19:14:10] iteration 1387/ 11920 | consumed samples: 1420288 | elapsed time per iteration (ms): 5633.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.265399E+00 | loss scale: 1.0 | grad norm: 0.132 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:29:00.175341 | finish at 2025-09-10 11:43:11 + [2025-09-09 19:14:16] iteration 1388/ 11920 | consumed samples: 1421312 | elapsed time per iteration (ms): 5636.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.285643E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:29:19.247572 | finish at 2025-09-10 11:43:35 + [2025-09-09 19:14:22] iteration 1389/ 11920 | consumed samples: 1422336 | elapsed time per iteration (ms): 5632.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.271796E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:28:33.878298 | finish at 2025-09-10 11:42:56 + [2025-09-09 19:14:27] iteration 1390/ 11920 | consumed samples: 1423360 | elapsed time per iteration (ms): 5628.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.275621E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:27:52.018783 | finish at 2025-09-10 11:42:19 + [2025-09-09 19:14:33] iteration 1391/ 11920 | consumed samples: 1424384 | elapsed time per iteration (ms): 5619.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.274780E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:26:10.169759 | finish at 2025-09-10 11:40:43 + [2025-09-09 19:14:39] iteration 1392/ 11920 | consumed samples: 1425408 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.266846E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:26:06.276955 | finish at 2025-09-10 11:40:45 + [2025-09-09 19:14:44] iteration 1393/ 11920 | consumed samples: 1426432 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.255267E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:26:43.683114 | finish at 2025-09-10 11:41:28 + [2025-09-09 19:14:50] iteration 1394/ 11920 | consumed samples: 1427456 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.284323E+00 | loss scale: 1.0 | grad norm: 0.243 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:26:37.770526 | finish at 2025-09-10 11:41:28 + [2025-09-09 19:14:55] iteration 1395/ 11920 | consumed samples: 1428480 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.260334E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:25:54.543877 | finish at 2025-09-10 11:40:50 + [2025-09-09 19:15:01] iteration 1396/ 11920 | consumed samples: 1429504 | elapsed time per iteration (ms): 5633.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.271531E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:28:09.848056 | finish at 2025-09-10 11:43:11 + [2025-09-09 19:15:07] iteration 1397/ 11920 | consumed samples: 1430528 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.258642E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:27:17.639457 | finish at 2025-09-10 11:42:24 + [2025-09-09 19:15:12] iteration 1398/ 11920 | consumed samples: 1431552 | elapsed time per iteration (ms): 5628.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.260559E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:27:01.895270 | finish at 2025-09-10 11:42:14 + [2025-09-09 19:15:18] iteration 1399/ 11920 | consumed samples: 1432576 | elapsed time per iteration (ms): 5630.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.267677E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:27:21.621808 | finish at 2025-09-10 11:42:40 + [2025-09-09 19:15:24] iteration 1400/ 11920 | consumed samples: 1433600 | elapsed time per iteration (ms): 5629.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.259477E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:26:57.337799 | finish at 2025-09-10 11:42:21 + [2025-09-09 19:15:29] iteration 1401/ 11920 | consumed samples: 1434624 | elapsed time per iteration (ms): 5633.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.266824E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:27:34.386135 | finish at 2025-09-10 11:43:04 + [2025-09-09 19:15:35] iteration 1402/ 11920 | consumed samples: 1435648 | elapsed time per iteration (ms): 5628.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.268732E+00 | loss scale: 1.0 | grad norm: 0.244 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:26:41.997236 | finish at 2025-09-10 11:42:17 + [2025-09-09 19:15:40] iteration 1403/ 11920 | consumed samples: 1436672 | elapsed time per iteration (ms): 5627.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.279293E+00 | loss scale: 1.0 | grad norm: 0.277 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:26:27.865843 | finish at 2025-09-10 11:42:08 + [2025-09-09 19:15:46] iteration 1404/ 11920 | consumed samples: 1437696 | elapsed time per iteration (ms): 5630.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.270989E+00 | loss scale: 1.0 | grad norm: 0.276 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:26:45.675412 | finish at 2025-09-10 11:42:32 + [2025-09-09 19:15:52] iteration 1405/ 11920 | consumed samples: 1438720 | elapsed time per iteration (ms): 6019.2 | throughput per GPU (TFLOP/s/GPU): 75.0 | MFU 7.58% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.274535E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:34:52.305068 | finish at 2025-09-10 12:50:44 + [2025-09-09 19:15:58] iteration 1406/ 11920 | consumed samples: 1439744 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.276893E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:25:21.943143 | finish at 2025-09-10 11:41:20 + [2025-09-09 19:16:04] iteration 1407/ 11920 | consumed samples: 1440768 | elapsed time per iteration (ms): 5841.5 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.265514E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:03:31.970689 | finish at 2025-09-10 12:19:36 + [2025-09-09 19:16:09] iteration 1408/ 11920 | consumed samples: 1441792 | elapsed time per iteration (ms): 5619.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.267783E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:24:26.423801 | finish at 2025-09-10 11:40:36 + [2025-09-09 19:16:15] iteration 1409/ 11920 | consumed samples: 1442816 | elapsed time per iteration (ms): 6000.8 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.256718E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:31:14.751014 | finish at 2025-09-10 12:47:30 + [2025-09-09 19:16:21] iteration 1410/ 11920 | consumed samples: 1443840 | elapsed time per iteration (ms): 5931.9 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.255399E+00 | loss scale: 1.0 | grad norm: 0.119 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:19:04.747860 | finish at 2025-09-10 12:35:26 + [2025-09-09 19:16:27] iteration 1411/ 11920 | consumed samples: 1444864 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.262361E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:24:27.376330 | finish at 2025-09-10 11:40:54 + [2025-09-09 19:16:32] iteration 1412/ 11920 | consumed samples: 1445888 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.255735E+00 | loss scale: 1.0 | grad norm: 0.117 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:24:32.535999 | finish at 2025-09-10 11:41:05 + [2025-09-09 19:16:38] iteration 1413/ 11920 | consumed samples: 1446912 | elapsed time per iteration (ms): 5832.3 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.271673E+00 | loss scale: 1.0 | grad norm: 0.108 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:01:19.487063 | finish at 2025-09-10 12:17:58 + [2025-09-09 19:16:44] iteration 1414/ 11920 | consumed samples: 1447936 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.268142E+00 | loss scale: 1.0 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:24:05.382002 | finish at 2025-09-10 11:40:49 + [2025-09-09 19:16:49] iteration 1415/ 11920 | consumed samples: 1448960 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.248228E+00 | loss scale: 1.0 | grad norm: 0.120 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:25:02.158624 | finish at 2025-09-10 11:41:52 + [2025-09-09 19:16:55] iteration 1416/ 11920 | consumed samples: 1449984 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.257189E+00 | loss scale: 1.0 | grad norm: 0.131 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:24:25.616341 | finish at 2025-09-10 11:41:21 + [2025-09-09 19:17:01] iteration 1417/ 11920 | consumed samples: 1451008 | elapsed time per iteration (ms): 5640.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.261744E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:27:17.920242 | finish at 2025-09-10 11:44:19 + [2025-09-09 19:17:06] iteration 1418/ 11920 | consumed samples: 1452032 | elapsed time per iteration (ms): 5639.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.257786E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:27:06.170699 | finish at 2025-09-10 11:44:13 + [2025-09-09 19:17:12] iteration 1419/ 11920 | consumed samples: 1453056 | elapsed time per iteration (ms): 5632.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.249977E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:25:51.225603 | finish at 2025-09-10 11:43:03 + [2025-09-09 19:17:18] iteration 1420/ 11920 | consumed samples: 1454080 | elapsed time per iteration (ms): 5921.7 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.241312E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:16:18.185463 | finish at 2025-09-10 12:33:36 + [2025-09-09 19:17:24] iteration 1421/ 11920 | consumed samples: 1455104 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.258308E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:24:14.326787 | finish at 2025-09-10 11:41:38 + [2025-09-09 19:17:29] iteration 1422/ 11920 | consumed samples: 1456128 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.256304E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:24:13.397504 | finish at 2025-09-10 11:41:43 + [2025-09-09 19:17:35] iteration 1423/ 11920 | consumed samples: 1457152 | elapsed time per iteration (ms): 5953.5 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.255849E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:21:33.714653 | finish at 2025-09-10 12:39:09 + [2025-09-09 19:17:41] iteration 1424/ 11920 | consumed samples: 1458176 | elapsed time per iteration (ms): 5825.7 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.84% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.257444E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:59:07.025635 | finish at 2025-09-10 12:16:48 + [2025-09-09 19:17:47] iteration 1425/ 11920 | consumed samples: 1459200 | elapsed time per iteration (ms): 5822.2 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.84% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.264633E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:58:24.107232 | finish at 2025-09-10 12:16:11 + [2025-09-09 19:17:53] iteration 1426/ 11920 | consumed samples: 1460224 | elapsed time per iteration (ms): 5990.9 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.263984E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:27:48.792742 | finish at 2025-09-10 12:45:42 + [2025-09-09 19:17:59] iteration 1427/ 11920 | consumed samples: 1461248 | elapsed time per iteration (ms): 5971.8 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.255299E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:24:21.600489 | finish at 2025-09-10 12:42:20 + [2025-09-09 19:18:05] iteration 1428/ 11920 | consumed samples: 1462272 | elapsed time per iteration (ms): 5931.7 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.249079E+00 | loss scale: 1.0 | grad norm: 0.120 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:17:15.601430 | finish at 2025-09-10 12:35:20 + [2025-09-09 19:18:11] iteration 1429/ 11920 | consumed samples: 1463296 | elapsed time per iteration (ms): 6148.8 | throughput per GPU (TFLOP/s/GPU): 73.4 | MFU 7.42% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.248717E+00 | loss scale: 1.0 | grad norm: 0.136 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:55:07.449906 | finish at 2025-09-10 13:13:18 + [2025-09-09 19:18:17] iteration 1430/ 11920 | consumed samples: 1464320 | elapsed time per iteration (ms): 5946.7 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.260850E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:19:40.461338 | finish at 2025-09-10 12:37:57 + [2025-09-09 19:18:23] iteration 1431/ 11920 | consumed samples: 1465344 | elapsed time per iteration (ms): 5856.0 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.247252E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:03:43.981063 | finish at 2025-09-10 12:22:07 + [2025-09-09 19:18:28] iteration 1432/ 11920 | consumed samples: 1466368 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.244077E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:22:21.781134 | finish at 2025-09-10 11:40:50 + [2025-09-09 19:18:34] iteration 1433/ 11920 | consumed samples: 1467392 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.259352E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:22:58.061162 | finish at 2025-09-10 11:41:32 + [2025-09-09 19:18:39] iteration 1434/ 11920 | consumed samples: 1468416 | elapsed time per iteration (ms): 5617.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.265916E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:21:43.098154 | finish at 2025-09-10 11:40:23 + [2025-09-09 19:18:45] iteration 1435/ 11920 | consumed samples: 1469440 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.250392E+00 | loss scale: 1.0 | grad norm: 0.136 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:22:10.758433 | finish at 2025-09-10 11:40:56 + [2025-09-09 19:18:51] iteration 1436/ 11920 | consumed samples: 1470464 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.229545E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:23:06.107716 | finish at 2025-09-10 11:41:57 + [2025-09-09 19:18:56] iteration 1437/ 11920 | consumed samples: 1471488 | elapsed time per iteration (ms): 5624.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.248594E+00 | loss scale: 1.0 | grad norm: 0.137 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:22:42.703598 | finish at 2025-09-10 11:41:39 + [2025-09-09 19:19:02] iteration 1438/ 11920 | consumed samples: 1472512 | elapsed time per iteration (ms): 5879.9 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.247853E+00 | loss scale: 1.0 | grad norm: 0.132 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:07:12.779195 | finish at 2025-09-10 12:26:15 + [2025-09-09 19:19:08] iteration 1439/ 11920 | consumed samples: 1473536 | elapsed time per iteration (ms): 5634.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.240004E+00 | loss scale: 1.0 | grad norm: 0.134 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:24:12.588464 | finish at 2025-09-10 11:43:20 + [2025-09-09 19:19:13] iteration 1440/ 11920 | consumed samples: 1474560 | elapsed time per iteration (ms): 5630.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.244628E+00 | loss scale: 1.0 | grad norm: 0.130 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:23:27.623329 | finish at 2025-09-10 11:42:41 + [2025-09-09 19:19:19] iteration 1441/ 11920 | consumed samples: 1475584 | elapsed time per iteration (ms): 5630.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.256927E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:23:16.586318 | finish at 2025-09-10 11:42:36 + [2025-09-09 19:19:25] iteration 1442/ 11920 | consumed samples: 1476608 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.250580E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:22:00.915708 | finish at 2025-09-10 11:41:26 + [2025-09-09 19:19:30] iteration 1443/ 11920 | consumed samples: 1477632 | elapsed time per iteration (ms): 5633.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.229519E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:23:37.596872 | finish at 2025-09-10 11:43:08 + [2025-09-09 19:19:36] iteration 1444/ 11920 | consumed samples: 1478656 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.237339E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 11.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:21:39.576015 | finish at 2025-09-10 11:41:16 + [2025-09-09 19:19:42] iteration 1445/ 11920 | consumed samples: 1479680 | elapsed time per iteration (ms): 5623.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.242712E+00 | loss scale: 1.0 | grad norm: 0.134 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:21:40.464493 | finish at 2025-09-10 11:41:22 + [2025-09-09 19:19:47] iteration 1446/ 11920 | consumed samples: 1480704 | elapsed time per iteration (ms): 5617.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.263940E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:20:31.937165 | finish at 2025-09-10 11:40:19 + [2025-09-09 19:19:53] iteration 1447/ 11920 | consumed samples: 1481728 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.250123E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:21:10.189266 | finish at 2025-09-10 11:41:03 + [2025-09-09 19:19:58] iteration 1448/ 11920 | consumed samples: 1482752 | elapsed time per iteration (ms): 5618.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.247034E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:20:37.096724 | finish at 2025-09-10 11:40:36 + [2025-09-09 19:20:04] iteration 1449/ 11920 | consumed samples: 1483776 | elapsed time per iteration (ms): 5631.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.252091E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:22:42.605869 | finish at 2025-09-10 11:42:47 + [2025-09-09 19:20:10] iteration 1450/ 11920 | consumed samples: 1484800 | elapsed time per iteration (ms): 5638.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.242606E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:23:55.549057 | finish at 2025-09-10 11:44:05 + [2025-09-09 19:20:15] iteration 1451/ 11920 | consumed samples: 1485824 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.252620E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:21:39.928603 | finish at 2025-09-10 11:41:55 + [2025-09-09 19:20:21] iteration 1452/ 11920 | consumed samples: 1486848 | elapsed time per iteration (ms): 5625.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.253888E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:21:26.493225 | finish at 2025-09-10 11:41:47 + [2025-09-09 19:20:27] iteration 1453/ 11920 | consumed samples: 1487872 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.261979E+00 | loss scale: 1.0 | grad norm: 0.128 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:21:42.920818 | finish at 2025-09-10 11:42:10 + [2025-09-09 19:20:32] iteration 1454/ 11920 | consumed samples: 1488896 | elapsed time per iteration (ms): 5631.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.251510E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:22:16.806229 | finish at 2025-09-10 11:42:49 + [2025-09-09 19:20:38] iteration 1455/ 11920 | consumed samples: 1489920 | elapsed time per iteration (ms): 5626.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.249160E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:21:22.835858 | finish at 2025-09-10 11:42:01 + [2025-09-09 19:20:43] iteration 1456/ 11920 | consumed samples: 1490944 | elapsed time per iteration (ms): 5635.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.258680E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:22:44.739693 | finish at 2025-09-10 11:43:28 + [2025-09-09 19:20:49] iteration 1457/ 11920 | consumed samples: 1491968 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.257710E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:20:55.445172 | finish at 2025-09-10 11:41:45 + [2025-09-09 19:20:55] iteration 1458/ 11920 | consumed samples: 1492992 | elapsed time per iteration (ms): 5629.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.244118E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:21:39.759154 | finish at 2025-09-10 11:42:35 + [2025-09-09 19:21:01] iteration 1459/ 11920 | consumed samples: 1494016 | elapsed time per iteration (ms): 5958.3 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.258771E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:18:49.797442 | finish at 2025-09-10 12:39:51 + [2025-09-09 19:21:07] iteration 1460/ 11920 | consumed samples: 1495040 | elapsed time per iteration (ms): 5866.0 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.248918E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:02:38.222179 | finish at 2025-09-10 12:23:45 + [2025-09-09 19:21:12] iteration 1461/ 11920 | consumed samples: 1496064 | elapsed time per iteration (ms): 5633.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.250973E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:21:55.717983 | finish at 2025-09-10 11:43:08 + [2025-09-09 19:21:18] iteration 1462/ 11920 | consumed samples: 1497088 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.251763E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:20:17.483271 | finish at 2025-09-10 11:41:35 + [2025-09-09 19:21:23] iteration 1463/ 11920 | consumed samples: 1498112 | elapsed time per iteration (ms): 5639.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.238199E+00 | loss scale: 1.0 | grad norm: 0.133 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:22:47.742880 | finish at 2025-09-10 11:44:11 + [2025-09-09 19:21:29] iteration 1464/ 11920 | consumed samples: 1499136 | elapsed time per iteration (ms): 5633.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.250875E+00 | loss scale: 1.0 | grad norm: 0.122 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:21:48.314407 | finish at 2025-09-10 11:43:17 + [2025-09-09 19:21:35] iteration 1465/ 11920 | consumed samples: 1500160 | elapsed time per iteration (ms): 5846.9 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.234878E+00 | loss scale: 1.0 | grad norm: 0.118 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:58:48.992876 | finish at 2025-09-10 12:20:24 + [2025-09-09 19:21:41] iteration 1466/ 11920 | consumed samples: 1501184 | elapsed time per iteration (ms): 5631.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.232991E+00 | loss scale: 1.0 | grad norm: 0.132 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:21:09.378117 | finish at 2025-09-10 11:42:50 + [2025-09-09 19:21:46] iteration 1467/ 11920 | consumed samples: 1502208 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.238695E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:20:01.596620 | finish at 2025-09-10 11:41:48 + [2025-09-09 19:21:52] iteration 1468/ 11920 | consumed samples: 1503232 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.240178E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:19:28.542383 | finish at 2025-09-10 11:41:20 + [2025-09-09 19:21:57] iteration 1469/ 11920 | consumed samples: 1504256 | elapsed time per iteration (ms): 5631.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.258095E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:20:53.879643 | finish at 2025-09-10 11:42:51 + [2025-09-09 19:22:03] iteration 1470/ 11920 | consumed samples: 1505280 | elapsed time per iteration (ms): 5630.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.259376E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:20:38.207591 | finish at 2025-09-10 11:42:41 + [2025-09-09 19:22:09] iteration 1471/ 11920 | consumed samples: 1506304 | elapsed time per iteration (ms): 5628.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.253332E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:20:16.179826 | finish at 2025-09-10 11:42:25 + [2025-09-09 19:22:14] iteration 1472/ 11920 | consumed samples: 1507328 | elapsed time per iteration (ms): 5628.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.242417E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:20:04.522732 | finish at 2025-09-10 11:42:19 + [2025-09-09 19:22:20] iteration 1473/ 11920 | consumed samples: 1508352 | elapsed time per iteration (ms): 5631.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.231119E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:20:31.899474 | finish at 2025-09-10 11:42:52 + [2025-09-09 19:22:26] iteration 1474/ 11920 | consumed samples: 1509376 | elapsed time per iteration (ms): 5637.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.235560E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:21:33.434857 | finish at 2025-09-10 11:43:59 + [2025-09-09 19:22:31] iteration 1475/ 11920 | consumed samples: 1510400 | elapsed time per iteration (ms): 5627.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.236427E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:19:35.537539 | finish at 2025-09-10 11:42:07 + [2025-09-09 19:22:37] iteration 1476/ 11920 | consumed samples: 1511424 | elapsed time per iteration (ms): 5629.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.235174E+00 | loss scale: 1.0 | grad norm: 0.119 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:19:57.407945 | finish at 2025-09-10 11:42:34 + [2025-09-09 19:22:43] iteration 1477/ 11920 | consumed samples: 1512448 | elapsed time per iteration (ms): 5960.4 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.236347E+00 | loss scale: 1.0 | grad norm: 0.130 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:17:24.119678 | finish at 2025-09-10 12:40:07 + [2025-09-09 19:22:48] iteration 1478/ 11920 | consumed samples: 1513472 | elapsed time per iteration (ms): 5622.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.226856E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:18:29.733624 | finish at 2025-09-10 11:41:18 + [2025-09-09 19:22:54] iteration 1479/ 11920 | consumed samples: 1514496 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.245007E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:18:26.772255 | finish at 2025-09-10 11:41:21 + [2025-09-09 19:23:00] iteration 1480/ 11920 | consumed samples: 1515520 | elapsed time per iteration (ms): 5937.8 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.243094E+00 | loss scale: 1.0 | grad norm: 0.248 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:13:10.539179 | finish at 2025-09-10 12:36:11 + [2025-09-09 19:23:06] iteration 1481/ 11920 | consumed samples: 1516544 | elapsed time per iteration (ms): 5913.1 | throughput per GPU (TFLOP/s/GPU): 76.4 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.247787E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:08:47.288982 | finish at 2025-09-10 12:31:53 + [2025-09-09 19:23:12] iteration 1482/ 11920 | consumed samples: 1517568 | elapsed time per iteration (ms): 5631.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.226759E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:19:41.990261 | finish at 2025-09-10 11:42:54 + [2025-09-09 19:23:17] iteration 1483/ 11920 | consumed samples: 1518592 | elapsed time per iteration (ms): 5838.2 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.230769E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:55:33.699969 | finish at 2025-09-10 12:18:51 + [2025-09-09 19:23:23] iteration 1484/ 11920 | consumed samples: 1519616 | elapsed time per iteration (ms): 5969.1 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.240463E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:18:13.609693 | finish at 2025-09-10 12:41:37 + [2025-09-09 19:23:29] iteration 1485/ 11920 | consumed samples: 1520640 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.229093E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:18:19.803256 | finish at 2025-09-10 11:41:49 + [2025-09-09 19:23:35] iteration 1486/ 11920 | consumed samples: 1521664 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.241761E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:17:59.030617 | finish at 2025-09-10 11:41:34 + [2025-09-09 19:23:40] iteration 1487/ 11920 | consumed samples: 1522688 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.238929E+00 | loss scale: 1.0 | grad norm: 0.124 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:17:18.095359 | finish at 2025-09-10 11:40:58 + [2025-09-09 19:23:46] iteration 1488/ 11920 | consumed samples: 1523712 | elapsed time per iteration (ms): 6065.3 | throughput per GPU (TFLOP/s/GPU): 74.4 | MFU 7.53% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.221967E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:34:33.430832 | finish at 2025-09-10 12:58:20 + [2025-09-09 19:23:52] iteration 1489/ 11920 | consumed samples: 1524736 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.241010E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:16:57.254866 | finish at 2025-09-10 11:40:49 + [2025-09-09 19:23:58] iteration 1490/ 11920 | consumed samples: 1525760 | elapsed time per iteration (ms): 5631.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.246545E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:19:00.580983 | finish at 2025-09-10 11:42:58 + [2025-09-09 19:24:03] iteration 1491/ 11920 | consumed samples: 1526784 | elapsed time per iteration (ms): 5638.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.225711E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:20:03.073329 | finish at 2025-09-10 11:44:06 + [2025-09-09 19:24:09] iteration 1492/ 11920 | consumed samples: 1527808 | elapsed time per iteration (ms): 5647.6 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.241828E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:21:33.455558 | finish at 2025-09-10 11:45:42 + [2025-09-09 19:24:14] iteration 1493/ 11920 | consumed samples: 1528832 | elapsed time per iteration (ms): 5632.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.228390E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:18:52.445953 | finish at 2025-09-10 11:43:07 + [2025-09-09 19:24:20] iteration 1494/ 11920 | consumed samples: 1529856 | elapsed time per iteration (ms): 5627.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.233249E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:17:51.323781 | finish at 2025-09-10 11:42:11 + [2025-09-09 19:24:26] iteration 1495/ 11920 | consumed samples: 1530880 | elapsed time per iteration (ms): 6010.0 | throughput per GPU (TFLOP/s/GPU): 75.1 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.227976E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:24:14.428858 | finish at 2025-09-10 12:48:41 + [2025-09-09 19:24:32] iteration 1496/ 11920 | consumed samples: 1531904 | elapsed time per iteration (ms): 5629.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.237403E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:17:56.819727 | finish at 2025-09-10 11:42:29 + [2025-09-09 19:24:37] iteration 1497/ 11920 | consumed samples: 1532928 | elapsed time per iteration (ms): 5618.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.228650E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:15:59.185134 | finish at 2025-09-10 11:40:37 + [2025-09-09 19:24:43] iteration 1498/ 11920 | consumed samples: 1533952 | elapsed time per iteration (ms): 5627.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.229331E+00 | loss scale: 1.0 | grad norm: 0.136 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:17:31.045511 | finish at 2025-09-10 11:42:14 + [2025-09-09 19:24:49] iteration 1499/ 11920 | consumed samples: 1534976 | elapsed time per iteration (ms): 5875.5 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.220448E+00 | loss scale: 1.0 | grad norm: 0.131 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:00:28.150324 | finish at 2025-09-10 12:25:17 + [2025-09-09 19:24:55] iteration 1500/ 11920 | consumed samples: 1536000 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.230075E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:16:43.862014 | finish at 2025-09-10 11:41:38 + [2025-09-09 19:25:00] iteration 1501/ 11920 | consumed samples: 1537024 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.226804E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:16:07.862474 | finish at 2025-09-10 11:41:08 + [2025-09-09 19:25:06] iteration 1502/ 11920 | consumed samples: 1538048 | elapsed time per iteration (ms): 5629.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.229676E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:17:32.648200 | finish at 2025-09-10 11:42:38 + [2025-09-09 19:25:11] iteration 1503/ 11920 | consumed samples: 1539072 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.233713E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:17:20.392004 | finish at 2025-09-10 11:42:32 + [2025-09-09 19:25:17] iteration 1504/ 11920 | consumed samples: 1540096 | elapsed time per iteration (ms): 5636.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.221217E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:18:32.037666 | finish at 2025-09-10 11:43:49 + [2025-09-09 19:25:23] iteration 1505/ 11920 | consumed samples: 1541120 | elapsed time per iteration (ms): 5632.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.234877E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:17:45.888692 | finish at 2025-09-10 11:43:09 + [2025-09-09 19:25:28] iteration 1506/ 11920 | consumed samples: 1542144 | elapsed time per iteration (ms): 5632.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.212101E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:17:38.810823 | finish at 2025-09-10 11:43:07 + [2025-09-09 19:25:34] iteration 1507/ 11920 | consumed samples: 1543168 | elapsed time per iteration (ms): 5615.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.217241E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:14:38.190845 | finish at 2025-09-10 11:40:12 + [2025-09-09 19:25:40] iteration 1508/ 11920 | consumed samples: 1544192 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.224487E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:16:01.487593 | finish at 2025-09-10 11:41:41 + [2025-09-09 19:25:45] iteration 1509/ 11920 | consumed samples: 1545216 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.231463E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:15:17.962827 | finish at 2025-09-10 11:41:03 + [2025-09-09 19:25:51] iteration 1510/ 11920 | consumed samples: 1546240 | elapsed time per iteration (ms): 5627.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.222406E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:16:17.991772 | finish at 2025-09-10 11:42:09 + [2025-09-09 19:25:56] iteration 1511/ 11920 | consumed samples: 1547264 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.226314E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:15:11.947721 | finish at 2025-09-10 11:41:08 + [2025-09-09 19:26:02] iteration 1512/ 11920 | consumed samples: 1548288 | elapsed time per iteration (ms): 5637.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.222586E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:17:54.542168 | finish at 2025-09-10 11:43:57 + [2025-09-09 19:26:08] iteration 1513/ 11920 | consumed samples: 1549312 | elapsed time per iteration (ms): 5638.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.219489E+00 | loss scale: 1.0 | grad norm: 0.256 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:18:03.352878 | finish at 2025-09-10 11:44:11 + [2025-09-09 19:26:13] iteration 1514/ 11920 | consumed samples: 1550336 | elapsed time per iteration (ms): 5636.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.204578E+00 | loss scale: 1.0 | grad norm: 0.243 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:17:33.832094 | finish at 2025-09-10 11:43:47 + [2025-09-09 19:26:19] iteration 1515/ 11920 | consumed samples: 1551360 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.220763E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:14:49.859504 | finish at 2025-09-10 11:41:09 + [2025-09-09 19:26:25] iteration 1516/ 11920 | consumed samples: 1552384 | elapsed time per iteration (ms): 5631.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.238282E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:16:29.892892 | finish at 2025-09-10 11:42:54 + [2025-09-09 19:26:30] iteration 1517/ 11920 | consumed samples: 1553408 | elapsed time per iteration (ms): 5627.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.231274E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:15:47.501355 | finish at 2025-09-10 11:42:18 + [2025-09-09 19:26:36] iteration 1518/ 11920 | consumed samples: 1554432 | elapsed time per iteration (ms): 5628.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.219990E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:15:45.561216 | finish at 2025-09-10 11:42:21 + [2025-09-09 19:26:41] iteration 1519/ 11920 | consumed samples: 1555456 | elapsed time per iteration (ms): 5630.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.220237E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:15:57.638630 | finish at 2025-09-10 11:42:39 + [2025-09-09 19:26:47] iteration 1520/ 11920 | consumed samples: 1556480 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.212209E+00 | loss scale: 1.0 | grad norm: 0.128 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:14:58.889160 | finish at 2025-09-10 11:41:46 + [2025-09-09 19:26:53] iteration 1521/ 11920 | consumed samples: 1557504 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.217912E+00 | loss scale: 1.0 | grad norm: 0.123 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:14:02.745749 | finish at 2025-09-10 11:40:55 + [2025-09-09 19:26:58] iteration 1522/ 11920 | consumed samples: 1558528 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.222444E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:14:06.095012 | finish at 2025-09-10 11:41:04 + [2025-09-09 19:27:04] iteration 1523/ 11920 | consumed samples: 1559552 | elapsed time per iteration (ms): 5627.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.213238E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:15:04.673538 | finish at 2025-09-10 11:42:09 + [2025-09-09 19:27:10] iteration 1524/ 11920 | consumed samples: 1560576 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.225363E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:15:01.321820 | finish at 2025-09-10 11:42:11 + [2025-09-09 19:27:15] iteration 1525/ 11920 | consumed samples: 1561600 | elapsed time per iteration (ms): 5629.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.208344E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:15:20.758195 | finish at 2025-09-10 11:42:36 + [2025-09-09 19:27:21] iteration 1526/ 11920 | consumed samples: 1562624 | elapsed time per iteration (ms): 5642.8 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.211083E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:17:31.616057 | finish at 2025-09-10 11:44:52 + [2025-09-09 19:27:26] iteration 1527/ 11920 | consumed samples: 1563648 | elapsed time per iteration (ms): 5643.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.231130E+00 | loss scale: 1.0 | grad norm: 0.127 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:17:28.518010 | finish at 2025-09-10 11:44:55 + [2025-09-09 19:27:32] iteration 1528/ 11920 | consumed samples: 1564672 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.215486E+00 | loss scale: 1.0 | grad norm: 0.130 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:14:14.900894 | finish at 2025-09-10 11:41:47 + [2025-09-09 19:27:38] iteration 1529/ 11920 | consumed samples: 1565696 | elapsed time per iteration (ms): 5626.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.221914E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:14:26.900180 | finish at 2025-09-10 11:42:05 + [2025-09-09 19:27:43] iteration 1530/ 11920 | consumed samples: 1566720 | elapsed time per iteration (ms): 5632.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.206706E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:15:20.567012 | finish at 2025-09-10 11:43:04 + [2025-09-09 19:27:49] iteration 1531/ 11920 | consumed samples: 1567744 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.220915E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:13:12.851662 | finish at 2025-09-10 11:41:02 + [2025-09-09 19:27:55] iteration 1532/ 11920 | consumed samples: 1568768 | elapsed time per iteration (ms): 5630.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.218407E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:14:51.566632 | finish at 2025-09-10 11:42:46 + [2025-09-09 19:28:00] iteration 1533/ 11920 | consumed samples: 1569792 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.211423E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:14:00.473208 | finish at 2025-09-10 11:42:01 + [2025-09-09 19:28:06] iteration 1534/ 11920 | consumed samples: 1570816 | elapsed time per iteration (ms): 5626.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.211900E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:14:00.376287 | finish at 2025-09-10 11:42:06 + [2025-09-09 19:28:12] iteration 1535/ 11920 | consumed samples: 1571840 | elapsed time per iteration (ms): 5819.7 | throughput per GPU (TFLOP/s/GPU): 77.6 | MFU 7.84% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.216263E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:47:17.604336 | finish at 2025-09-10 12:15:29 + [2025-09-09 19:28:18] iteration 1536/ 11920 | consumed samples: 1572864 | elapsed time per iteration (ms): 5975.8 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.218726E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:14:12.781860 | finish at 2025-09-10 12:42:30 + [2025-09-09 19:28:24] iteration 1537/ 11920 | consumed samples: 1573888 | elapsed time per iteration (ms): 5934.1 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.226656E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:06:54.123291 | finish at 2025-09-10 12:35:18 + [2025-09-09 19:28:30] iteration 1538/ 11920 | consumed samples: 1574912 | elapsed time per iteration (ms): 5953.9 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.215975E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:10:13.232409 | finish at 2025-09-10 12:38:43 + [2025-09-09 19:28:35] iteration 1539/ 11920 | consumed samples: 1575936 | elapsed time per iteration (ms): 5622.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.190507E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:12:48.703945 | finish at 2025-09-10 11:41:24 + [2025-09-09 19:28:41] iteration 1540/ 11920 | consumed samples: 1576960 | elapsed time per iteration (ms): 5618.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.215684E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:11:55.317950 | finish at 2025-09-10 11:40:36 + [2025-09-09 19:28:46] iteration 1541/ 11920 | consumed samples: 1577984 | elapsed time per iteration (ms): 5637.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.213466E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:15:06.797528 | finish at 2025-09-10 11:43:53 + [2025-09-09 19:28:52] iteration 1542/ 11920 | consumed samples: 1579008 | elapsed time per iteration (ms): 5633.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.233528E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:14:28.393230 | finish at 2025-09-10 11:43:20 + [2025-09-09 19:28:58] iteration 1543/ 11920 | consumed samples: 1580032 | elapsed time per iteration (ms): 6099.2 | throughput per GPU (TFLOP/s/GPU): 74.0 | MFU 7.48% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.206449E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:34:51.047189 | finish at 2025-09-10 13:03:49 + [2025-09-09 19:29:04] iteration 1544/ 11920 | consumed samples: 1581056 | elapsed time per iteration (ms): 5927.2 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.216955E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:05:00.642862 | finish at 2025-09-10 12:34:05 + [2025-09-09 19:29:10] iteration 1545/ 11920 | consumed samples: 1582080 | elapsed time per iteration (ms): 5913.1 | throughput per GPU (TFLOP/s/GPU): 76.4 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.209541E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:02:27.974718 | finish at 2025-09-10 12:31:38 + [2025-09-09 19:29:16] iteration 1546/ 11920 | consumed samples: 1583104 | elapsed time per iteration (ms): 5860.4 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.206455E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:53:15.732313 | finish at 2025-09-10 12:22:32 + [2025-09-09 19:29:22] iteration 1547/ 11920 | consumed samples: 1584128 | elapsed time per iteration (ms): 6230.6 | throughput per GPU (TFLOP/s/GPU): 72.5 | MFU 7.33% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.209481E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:57:09.624278 | finish at 2025-09-10 13:26:32 + [2025-09-09 19:29:28] iteration 1548/ 11920 | consumed samples: 1585152 | elapsed time per iteration (ms): 5638.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.202600E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:14:40.911900 | finish at 2025-09-10 11:44:09 + [2025-09-09 19:29:33] iteration 1549/ 11920 | consumed samples: 1586176 | elapsed time per iteration (ms): 5626.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.206208E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:12:31.609456 | finish at 2025-09-10 11:42:05 + [2025-09-09 19:29:39] iteration 1550/ 11920 | consumed samples: 1587200 | elapsed time per iteration (ms): 5632.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.213679E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:13:26.346698 | finish at 2025-09-10 11:43:05 + [2025-09-09 19:29:45] iteration 1551/ 11920 | consumed samples: 1588224 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.219676E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:11:25.009846 | finish at 2025-09-10 11:41:10 + [2025-09-09 19:29:50] iteration 1552/ 11920 | consumed samples: 1589248 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.209525E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:12:23.977661 | finish at 2025-09-10 11:42:14 + [2025-09-09 19:29:56] iteration 1553/ 11920 | consumed samples: 1590272 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.209856E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:11:39.201323 | finish at 2025-09-10 11:41:35 + [2025-09-09 19:30:02] iteration 1554/ 11920 | consumed samples: 1591296 | elapsed time per iteration (ms): 5627.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.208840E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:12:18.303563 | finish at 2025-09-10 11:42:20 + [2025-09-09 19:30:07] iteration 1555/ 11920 | consumed samples: 1592320 | elapsed time per iteration (ms): 5627.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.208845E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:12:12.510141 | finish at 2025-09-10 11:42:20 + [2025-09-09 19:30:13] iteration 1556/ 11920 | consumed samples: 1593344 | elapsed time per iteration (ms): 5632.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.219004E+00 | loss scale: 1.0 | grad norm: 0.250 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:12:51.542621 | finish at 2025-09-10 11:43:04 + [2025-09-09 19:30:18] iteration 1557/ 11920 | consumed samples: 1594368 | elapsed time per iteration (ms): 5638.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.221400E+00 | loss scale: 1.0 | grad norm: 0.256 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:13:52.721533 | finish at 2025-09-10 11:44:11 + [2025-09-09 19:30:24] iteration 1558/ 11920 | consumed samples: 1595392 | elapsed time per iteration (ms): 5636.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.200951E+00 | loss scale: 1.0 | grad norm: 0.269 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:13:27.121356 | finish at 2025-09-10 11:43:51 + [2025-09-09 19:30:30] iteration 1559/ 11920 | consumed samples: 1596416 | elapsed time per iteration (ms): 5630.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.216425E+00 | loss scale: 1.0 | grad norm: 0.260 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:12:15.800613 | finish at 2025-09-10 11:42:45 + [2025-09-09 19:30:35] iteration 1560/ 11920 | consumed samples: 1597440 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.217217E+00 | loss scale: 1.0 | grad norm: 0.258 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:11:14.170074 | finish at 2025-09-10 11:41:49 + [2025-09-09 19:30:41] iteration 1561/ 11920 | consumed samples: 1598464 | elapsed time per iteration (ms): 5633.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.202560E+00 | loss scale: 1.0 | grad norm: 0.258 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:12:34.814502 | finish at 2025-09-10 11:43:16 + [2025-09-09 19:30:47] iteration 1562/ 11920 | consumed samples: 1599488 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.219946E+00 | loss scale: 1.0 | grad norm: 0.245 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:11:16.949689 | finish at 2025-09-10 11:42:04 + [2025-09-09 19:30:52] iteration 1563/ 11920 | consumed samples: 1600512 | elapsed time per iteration (ms): 5648.5 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.232251E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:15:01.188197 | finish at 2025-09-10 11:45:53 + [2025-09-09 19:30:58] iteration 1564/ 11920 | consumed samples: 1601536 | elapsed time per iteration (ms): 5627.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.218112E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:11:20.370781 | finish at 2025-09-10 11:42:18 + [2025-09-09 19:31:03] iteration 1565/ 11920 | consumed samples: 1602560 | elapsed time per iteration (ms): 5642.0 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.226701E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:13:42.971306 | finish at 2025-09-10 11:44:46 + [2025-09-09 19:31:09] iteration 1566/ 11920 | consumed samples: 1603584 | elapsed time per iteration (ms): 6022.7 | throughput per GPU (TFLOP/s/GPU): 75.0 | MFU 7.58% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.211903E+00 | loss scale: 1.0 | grad norm: 0.247 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:19:18.962481 | finish at 2025-09-10 12:50:28 + [2025-09-09 19:31:16] iteration 1567/ 11920 | consumed samples: 1604608 | elapsed time per iteration (ms): 6299.9 | throughput per GPU (TFLOP/s/GPU): 71.7 | MFU 7.25% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.220730E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 18:07:02.926977 | finish at 2025-09-10 13:38:19 + [2025-09-09 19:31:21] iteration 1568/ 11920 | consumed samples: 1605632 | elapsed time per iteration (ms): 5643.0 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.223869E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:13:36.334835 | finish at 2025-09-10 11:44:58 + [2025-09-09 19:31:27] iteration 1569/ 11920 | consumed samples: 1606656 | elapsed time per iteration (ms): 5964.3 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.204816E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:08:56.335178 | finish at 2025-09-10 12:40:24 + [2025-09-09 19:31:33] iteration 1570/ 11920 | consumed samples: 1607680 | elapsed time per iteration (ms): 5632.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.201458E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:11:33.255222 | finish at 2025-09-10 11:43:06 + [2025-09-09 19:31:39] iteration 1571/ 11920 | consumed samples: 1608704 | elapsed time per iteration (ms): 5849.3 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.217883E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:48:54.769919 | finish at 2025-09-10 12:20:34 + [2025-09-09 19:31:45] iteration 1572/ 11920 | consumed samples: 1609728 | elapsed time per iteration (ms): 5866.6 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.209509E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:51:47.184901 | finish at 2025-09-10 12:23:32 + [2025-09-09 19:31:51] iteration 1573/ 11920 | consumed samples: 1610752 | elapsed time per iteration (ms): 5969.7 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.204741E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:09:28.612727 | finish at 2025-09-10 12:41:19 + [2025-09-09 19:31:56] iteration 1574/ 11920 | consumed samples: 1611776 | elapsed time per iteration (ms): 5635.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.212357E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:11:43.298919 | finish at 2025-09-10 11:43:40 + [2025-09-09 19:32:02] iteration 1575/ 11920 | consumed samples: 1612800 | elapsed time per iteration (ms): 5636.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.207801E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:11:51.500301 | finish at 2025-09-10 11:43:53 + [2025-09-09 19:32:08] iteration 1576/ 11920 | consumed samples: 1613824 | elapsed time per iteration (ms): 5632.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.196159E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:11:00.732124 | finish at 2025-09-10 11:43:08 + [2025-09-09 19:32:13] iteration 1577/ 11920 | consumed samples: 1614848 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.198828E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:09:15.578454 | finish at 2025-09-10 11:41:29 + [2025-09-09 19:32:19] iteration 1578/ 11920 | consumed samples: 1615872 | elapsed time per iteration (ms): 5860.5 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.215866E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:50:09.148994 | finish at 2025-09-10 12:22:28 + [2025-09-09 19:32:25] iteration 1579/ 11920 | consumed samples: 1616896 | elapsed time per iteration (ms): 5938.7 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.201392E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:03:32.589391 | finish at 2025-09-10 12:35:58 + [2025-09-09 19:32:31] iteration 1580/ 11920 | consumed samples: 1617920 | elapsed time per iteration (ms): 5626.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.193361E+00 | loss scale: 1.0 | grad norm: 0.126 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:09:41.748657 | finish at 2025-09-10 11:42:12 + [2025-09-09 19:32:36] iteration 1581/ 11920 | consumed samples: 1618944 | elapsed time per iteration (ms): 5628.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.200620E+00 | loss scale: 1.0 | grad norm: 0.122 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:09:53.362073 | finish at 2025-09-10 11:42:30 + [2025-09-09 19:32:42] iteration 1582/ 11920 | consumed samples: 1619968 | elapsed time per iteration (ms): 5871.6 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.192781E+00 | loss scale: 1.0 | grad norm: 0.125 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:51:40.651657 | finish at 2025-09-10 12:24:23 + [2025-09-09 19:32:48] iteration 1583/ 11920 | consumed samples: 1620992 | elapsed time per iteration (ms): 5629.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.194174E+00 | loss scale: 1.0 | grad norm: 0.132 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:09:47.805480 | finish at 2025-09-10 11:42:36 + [2025-09-09 19:32:54] iteration 1584/ 11920 | consumed samples: 1622016 | elapsed time per iteration (ms): 5984.2 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.199697E+00 | loss scale: 1.0 | grad norm: 0.131 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:10:52.848923 | finish at 2025-09-10 12:43:47 + [2025-09-09 19:32:59] iteration 1585/ 11920 | consumed samples: 1623040 | elapsed time per iteration (ms): 5627.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.194433E+00 | loss scale: 1.0 | grad norm: 0.134 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:09:21.119864 | finish at 2025-09-10 11:42:21 + [2025-09-09 19:33:05] iteration 1586/ 11920 | consumed samples: 1624064 | elapsed time per iteration (ms): 5627.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.216485E+00 | loss scale: 1.0 | grad norm: 0.129 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:09:18.579440 | finish at 2025-09-10 11:42:24 + [2025-09-09 19:33:11] iteration 1587/ 11920 | consumed samples: 1625088 | elapsed time per iteration (ms): 5619.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.195849E+00 | loss scale: 1.0 | grad norm: 0.137 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:07:42.993959 | finish at 2025-09-10 11:40:54 + [2025-09-09 19:33:16] iteration 1588/ 11920 | consumed samples: 1626112 | elapsed time per iteration (ms): 5634.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.179954E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:10:16.341548 | finish at 2025-09-10 11:43:33 + [2025-09-09 19:33:22] iteration 1589/ 11920 | consumed samples: 1627136 | elapsed time per iteration (ms): 5629.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.203203E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:09:21.757749 | finish at 2025-09-10 11:42:44 + [2025-09-09 19:33:28] iteration 1590/ 11920 | consumed samples: 1628160 | elapsed time per iteration (ms): 5638.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.186444E+00 | loss scale: 1.0 | grad norm: 0.229 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:10:45.655487 | finish at 2025-09-10 11:44:13 + [2025-09-09 19:33:33] iteration 1591/ 11920 | consumed samples: 1629184 | elapsed time per iteration (ms): 5635.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.199689E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:10:07.948682 | finish at 2025-09-10 11:43:41 + [2025-09-09 19:33:39] iteration 1592/ 11920 | consumed samples: 1630208 | elapsed time per iteration (ms): 5922.4 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.207958E+00 | loss scale: 1.0 | grad norm: 0.263 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:59:26.335411 | finish at 2025-09-10 12:33:05 + [2025-09-09 19:33:45] iteration 1593/ 11920 | consumed samples: 1631232 | elapsed time per iteration (ms): 5969.2 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.223322E+00 | loss scale: 1.0 | grad norm: 0.256 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:07:23.690958 | finish at 2025-09-10 12:41:09 + [2025-09-09 19:33:51] iteration 1594/ 11920 | consumed samples: 1632256 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.197664E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:07:40.063962 | finish at 2025-09-10 11:41:31 + [2025-09-09 19:33:57] iteration 1595/ 11920 | consumed samples: 1633280 | elapsed time per iteration (ms): 5828.9 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.207896E+00 | loss scale: 1.0 | grad norm: 0.242 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:43:03.560914 | finish at 2025-09-10 12:17:00 + [2025-09-09 19:34:02] iteration 1596/ 11920 | consumed samples: 1634304 | elapsed time per iteration (ms): 5630.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.200268E+00 | loss scale: 1.0 | grad norm: 0.247 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:08:44.758693 | finish at 2025-09-10 11:42:47 + [2025-09-09 19:34:08] iteration 1597/ 11920 | consumed samples: 1635328 | elapsed time per iteration (ms): 5630.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.203557E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:08:47.979088 | finish at 2025-09-10 11:42:56 + [2025-09-09 19:34:13] iteration 1598/ 11920 | consumed samples: 1636352 | elapsed time per iteration (ms): 5633.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.211342E+00 | loss scale: 1.0 | grad norm: 0.267 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:09:08.609036 | finish at 2025-09-10 11:43:22 + [2025-09-09 19:34:19] iteration 1599/ 11920 | consumed samples: 1637376 | elapsed time per iteration (ms): 5857.1 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.208961E+00 | loss scale: 1.0 | grad norm: 0.242 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:47:31.077527 | finish at 2025-09-10 12:21:50 + [2025-09-09 19:34:25] iteration 1600/ 11920 | consumed samples: 1638400 | elapsed time per iteration (ms): 5632.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.201783E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:08:44.941292 | finish at 2025-09-10 11:43:10 + [2025-09-09 19:34:31] iteration 1601/ 11920 | consumed samples: 1639424 | elapsed time per iteration (ms): 5626.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.178886E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:07:34.708014 | finish at 2025-09-10 11:42:05 + [2025-09-09 19:34:36] iteration 1602/ 11920 | consumed samples: 1640448 | elapsed time per iteration (ms): 5636.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.213709E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:09:21.570565 | finish at 2025-09-10 11:43:58 + [2025-09-09 19:34:42] iteration 1603/ 11920 | consumed samples: 1641472 | elapsed time per iteration (ms): 5629.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.211434E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:07:55.127938 | finish at 2025-09-10 11:42:37 + [2025-09-09 19:34:47] iteration 1604/ 11920 | consumed samples: 1642496 | elapsed time per iteration (ms): 5629.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.202276E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:07:57.762875 | finish at 2025-09-10 11:42:45 + [2025-09-09 19:34:53] iteration 1605/ 11920 | consumed samples: 1643520 | elapsed time per iteration (ms): 5628.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.195989E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:07:40.498112 | finish at 2025-09-10 11:42:34 + [2025-09-09 19:34:59] iteration 1606/ 11920 | consumed samples: 1644544 | elapsed time per iteration (ms): 5627.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.194749E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:07:26.670898 | finish at 2025-09-10 11:42:25 + [2025-09-09 19:35:04] iteration 1607/ 11920 | consumed samples: 1645568 | elapsed time per iteration (ms): 5632.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.198530E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:08:04.362277 | finish at 2025-09-10 11:43:09 + [2025-09-09 19:35:10] iteration 1608/ 11920 | consumed samples: 1646592 | elapsed time per iteration (ms): 5630.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.186491E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:07:37.379885 | finish at 2025-09-10 11:42:47 + [2025-09-09 19:35:16] iteration 1609/ 11920 | consumed samples: 1647616 | elapsed time per iteration (ms): 5648.2 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.209130E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:10:38.430770 | finish at 2025-09-10 11:45:54 + [2025-09-09 19:35:21] iteration 1610/ 11920 | consumed samples: 1648640 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.195720E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:06:40.745740 | finish at 2025-09-10 11:42:02 + [2025-09-09 19:35:27] iteration 1611/ 11920 | consumed samples: 1649664 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.187244E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:05:52.682700 | finish at 2025-09-10 11:41:20 + [2025-09-09 19:35:33] iteration 1612/ 11920 | consumed samples: 1650688 | elapsed time per iteration (ms): 5934.5 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.193872E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:59:32.727479 | finish at 2025-09-10 12:35:06 + [2025-09-09 19:35:38] iteration 1613/ 11920 | consumed samples: 1651712 | elapsed time per iteration (ms): 5628.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.195892E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:06:51.782087 | finish at 2025-09-10 11:42:30 + [2025-09-09 19:35:44] iteration 1614/ 11920 | consumed samples: 1652736 | elapsed time per iteration (ms): 5628.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.199022E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:06:47.286443 | finish at 2025-09-10 11:42:31 + [2025-09-09 19:35:50] iteration 1615/ 11920 | consumed samples: 1653760 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.190722E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:06:13.801575 | finish at 2025-09-10 11:42:03 + [2025-09-09 19:35:55] iteration 1616/ 11920 | consumed samples: 1654784 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.192497E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:05:26.923462 | finish at 2025-09-10 11:41:22 + [2025-09-09 19:36:01] iteration 1617/ 11920 | consumed samples: 1655808 | elapsed time per iteration (ms): 5619.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.195044E+00 | loss scale: 1.0 | grad norm: 0.131 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:04:54.035336 | finish at 2025-09-10 11:40:55 + [2025-09-09 19:36:07] iteration 1618/ 11920 | consumed samples: 1656832 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.184026E+00 | loss scale: 1.0 | grad norm: 0.137 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:05:46.350304 | finish at 2025-09-10 11:41:53 + [2025-09-09 19:36:12] iteration 1619/ 11920 | consumed samples: 1657856 | elapsed time per iteration (ms): 5626.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.180420E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:05:55.493163 | finish at 2025-09-10 11:42:08 + [2025-09-09 19:36:18] iteration 1620/ 11920 | consumed samples: 1658880 | elapsed time per iteration (ms): 5630.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.197006E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:06:37.009254 | finish at 2025-09-10 11:42:55 + [2025-09-09 19:36:23] iteration 1621/ 11920 | consumed samples: 1659904 | elapsed time per iteration (ms): 5629.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.188310E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:06:13.669605 | finish at 2025-09-10 11:42:37 + [2025-09-09 19:36:29] iteration 1622/ 11920 | consumed samples: 1660928 | elapsed time per iteration (ms): 5627.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.193493E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:05:54.573586 | finish at 2025-09-10 11:42:24 + [2025-09-09 19:36:35] iteration 1623/ 11920 | consumed samples: 1661952 | elapsed time per iteration (ms): 5627.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.185864E+00 | loss scale: 1.0 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:05:48.344361 | finish at 2025-09-10 11:42:23 + [2025-09-09 19:36:40] iteration 1624/ 11920 | consumed samples: 1662976 | elapsed time per iteration (ms): 5634.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.193763E+00 | loss scale: 1.0 | grad norm: 0.105 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:06:51.390970 | finish at 2025-09-10 11:43:32 + [2025-09-09 19:36:46] iteration 1625/ 11920 | consumed samples: 1664000 | elapsed time per iteration (ms): 5628.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.184207E+00 | loss scale: 1.0 | grad norm: 0.121 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:05:43.605726 | finish at 2025-09-10 11:42:30 + [2025-09-09 19:36:52] iteration 1626/ 11920 | consumed samples: 1665024 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.183297E+00 | loss scale: 1.0 | grad norm: 0.120 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:04:34.728129 | finish at 2025-09-10 11:41:26 + [2025-09-09 19:36:57] iteration 1627/ 11920 | consumed samples: 1666048 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.178362E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:04:49.511311 | finish at 2025-09-10 11:41:47 + [2025-09-09 19:37:03] iteration 1628/ 11920 | consumed samples: 1667072 | elapsed time per iteration (ms): 5618.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.183099E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:03:49.309639 | finish at 2025-09-10 11:40:52 + [2025-09-09 19:37:08] iteration 1629/ 11920 | consumed samples: 1668096 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.190728E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:04:24.599078 | finish at 2025-09-10 11:41:33 + [2025-09-09 19:37:14] iteration 1630/ 11920 | consumed samples: 1669120 | elapsed time per iteration (ms): 5622.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.189128E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:04:15.291345 | finish at 2025-09-10 11:41:29 + [2025-09-09 19:37:20] iteration 1631/ 11920 | consumed samples: 1670144 | elapsed time per iteration (ms): 5636.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.192927E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:06:30.638067 | finish at 2025-09-10 11:43:50 + [2025-09-09 19:37:25] iteration 1632/ 11920 | consumed samples: 1671168 | elapsed time per iteration (ms): 5630.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.186960E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:05:26.332161 | finish at 2025-09-10 11:42:52 + [2025-09-09 19:37:31] iteration 1633/ 11920 | consumed samples: 1672192 | elapsed time per iteration (ms): 5629.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.192050E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:05:15.281413 | finish at 2025-09-10 11:42:46 + [2025-09-09 19:37:37] iteration 1634/ 11920 | consumed samples: 1673216 | elapsed time per iteration (ms): 5646.4 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.177504E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:07:59.066331 | finish at 2025-09-10 11:45:36 + [2025-09-09 19:37:43] iteration 1635/ 11920 | consumed samples: 1674240 | elapsed time per iteration (ms): 5961.2 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.191137E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:01:50.915015 | finish at 2025-09-10 12:39:33 + [2025-09-09 19:37:48] iteration 1636/ 11920 | consumed samples: 1675264 | elapsed time per iteration (ms): 5617.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.175858E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:02:52.096822 | finish at 2025-09-10 11:40:40 + [2025-09-09 19:37:54] iteration 1637/ 11920 | consumed samples: 1676288 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.188169E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:03:26.725576 | finish at 2025-09-10 11:41:21 + [2025-09-09 19:38:00] iteration 1638/ 11920 | consumed samples: 1677312 | elapsed time per iteration (ms): 5973.5 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.165142E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:03:39.345732 | finish at 2025-09-10 12:41:39 + [2025-09-09 19:38:05] iteration 1639/ 11920 | consumed samples: 1678336 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.198407E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:03:22.492791 | finish at 2025-09-10 11:41:28 + [2025-09-09 19:38:11] iteration 1640/ 11920 | consumed samples: 1679360 | elapsed time per iteration (ms): 5617.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.181286E+00 | loss scale: 1.0 | grad norm: 0.242 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:02:25.221806 | finish at 2025-09-10 11:40:36 + [2025-09-09 19:38:17] iteration 1641/ 11920 | consumed samples: 1680384 | elapsed time per iteration (ms): 5629.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.182454E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:04:24.703232 | finish at 2025-09-10 11:42:41 + [2025-09-09 19:38:22] iteration 1642/ 11920 | consumed samples: 1681408 | elapsed time per iteration (ms): 5639.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.204913E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:05:59.194968 | finish at 2025-09-10 11:44:21 + [2025-09-09 19:38:28] iteration 1643/ 11920 | consumed samples: 1682432 | elapsed time per iteration (ms): 5630.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.197590E+00 | loss scale: 1.0 | grad norm: 0.275 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:04:28.848994 | finish at 2025-09-10 11:42:57 + [2025-09-09 19:38:34] iteration 1644/ 11920 | consumed samples: 1683456 | elapsed time per iteration (ms): 5630.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.187175E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:04:23.433684 | finish at 2025-09-10 11:42:57 + [2025-09-09 19:38:39] iteration 1645/ 11920 | consumed samples: 1684480 | elapsed time per iteration (ms): 5636.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.190068E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:05:10.109836 | finish at 2025-09-10 11:43:49 + [2025-09-09 19:38:45] iteration 1646/ 11920 | consumed samples: 1685504 | elapsed time per iteration (ms): 5994.5 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.184984E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:06:27.656314 | finish at 2025-09-10 12:45:13 + [2025-09-09 19:38:51] iteration 1647/ 11920 | consumed samples: 1686528 | elapsed time per iteration (ms): 5616.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.172064E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:01:39.503625 | finish at 2025-09-10 11:40:30 + [2025-09-09 19:38:56] iteration 1648/ 11920 | consumed samples: 1687552 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.209211E+00 | loss scale: 1.0 | grad norm: 2.032 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:02:28.316826 | finish at 2025-09-10 11:41:25 + [2025-09-09 19:39:02] iteration 1649/ 11920 | consumed samples: 1688576 | elapsed time per iteration (ms): 5633.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.240149E+00 | loss scale: 1.0 | grad norm: 0.366 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:04:24.431967 | finish at 2025-09-10 11:43:26 + [2025-09-09 19:39:08] iteration 1650/ 11920 | consumed samples: 1689600 | elapsed time per iteration (ms): 5649.5 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.236718E+00 | loss scale: 1.0 | grad norm: 0.463 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:06:59.876640 | finish at 2025-09-10 11:46:08 + [2025-09-09 19:39:13] iteration 1651/ 11920 | consumed samples: 1690624 | elapsed time per iteration (ms): 5657.7 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.273924E+00 | loss scale: 1.0 | grad norm: 0.556 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:08:18.708931 | finish at 2025-09-10 11:47:32 + [2025-09-09 19:39:19] iteration 1652/ 11920 | consumed samples: 1691648 | elapsed time per iteration (ms): 5670.0 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.300025E+00 | loss scale: 1.0 | grad norm: 1.040 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:10:19.876586 | finish at 2025-09-10 11:49:39 + [2025-09-09 19:39:25] iteration 1653/ 11920 | consumed samples: 1692672 | elapsed time per iteration (ms): 5684.0 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.356370E+00 | loss scale: 1.0 | grad norm: 0.959 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:12:37.407851 | finish at 2025-09-10 11:52:02 + [2025-09-09 19:39:30] iteration 1654/ 11920 | consumed samples: 1693696 | elapsed time per iteration (ms): 5680.4 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.329900E+00 | loss scale: 1.0 | grad norm: 0.485 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:11:54.853148 | finish at 2025-09-10 11:51:25 + [2025-09-09 19:39:36] iteration 1655/ 11920 | consumed samples: 1694720 | elapsed time per iteration (ms): 5698.4 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.492573E+00 | loss scale: 1.0 | grad norm: 1.423 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:14:54.093343 | finish at 2025-09-10 11:54:30 + [2025-09-09 19:39:42] iteration 1656/ 11920 | consumed samples: 1695744 | elapsed time per iteration (ms): 5725.1 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.453996E+00 | loss scale: 1.0 | grad norm: 0.857 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:19:22.886875 | finish at 2025-09-10 11:59:05 + [2025-09-09 19:39:48] iteration 1657/ 11920 | consumed samples: 1696768 | elapsed time per iteration (ms): 5707.3 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 8.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.448791E+00 | loss scale: 1.0 | grad norm: 0.742 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:16:13.706162 | finish at 2025-09-10 11:56:01 + [2025-09-09 19:39:53] iteration 1658/ 11920 | consumed samples: 1697792 | elapsed time per iteration (ms): 5731.9 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.660897E+00 | loss scale: 1.0 | grad norm: 5.101 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:20:20.848087 | finish at 2025-09-10 12:00:14 + [2025-09-09 19:39:59] iteration 1659/ 11920 | consumed samples: 1698816 | elapsed time per iteration (ms): 5721.1 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.612027E+00 | loss scale: 1.0 | grad norm: 0.870 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:18:23.912027 | finish at 2025-09-10 11:58:23 + [2025-09-09 19:40:05] iteration 1660/ 11920 | consumed samples: 1699840 | elapsed time per iteration (ms): 5688.1 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.538832E+00 | loss scale: 1.0 | grad norm: 0.522 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:12:39.500957 | finish at 2025-09-10 11:52:44 + [2025-09-09 19:40:10] iteration 1661/ 11920 | consumed samples: 1700864 | elapsed time per iteration (ms): 5717.4 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.609251E+00 | loss scale: 1.0 | grad norm: 1.014 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:17:35.154682 | finish at 2025-09-10 11:57:46 + [2025-09-09 19:40:16] iteration 1662/ 11920 | consumed samples: 1701888 | elapsed time per iteration (ms): 5709.7 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 8.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.667090E+00 | loss scale: 1.0 | grad norm: 1.268 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:16:09.893373 | finish at 2025-09-10 11:56:26 + [2025-09-09 19:40:22] iteration 1663/ 11920 | consumed samples: 1702912 | elapsed time per iteration (ms): 5724.0 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.700556E+00 | loss scale: 1.0 | grad norm: 1.037 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:18:30.730292 | finish at 2025-09-10 11:58:53 + [2025-09-09 19:40:28] iteration 1664/ 11920 | consumed samples: 1703936 | elapsed time per iteration (ms): 5736.3 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.714142E+00 | loss scale: 1.0 | grad norm: 1.353 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:20:31.363117 | finish at 2025-09-10 12:00:59 + [2025-09-09 19:40:33] iteration 1665/ 11920 | consumed samples: 1704960 | elapsed time per iteration (ms): 5757.7 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.736385E+00 | loss scale: 1.0 | grad norm: 1.086 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:24:05.159366 | finish at 2025-09-10 12:04:38 + [2025-09-09 19:40:39] iteration 1666/ 11920 | consumed samples: 1705984 | elapsed time per iteration (ms): 5710.5 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.671070E+00 | loss scale: 1.0 | grad norm: 0.651 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:15:55.540362 | finish at 2025-09-10 11:56:35 + [2025-09-09 19:40:45] iteration 1667/ 11920 | consumed samples: 1707008 | elapsed time per iteration (ms): 5714.2 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.730369E+00 | loss scale: 1.0 | grad norm: 1.103 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:16:28.196372 | finish at 2025-09-10 11:57:13 + [2025-09-09 19:40:50] iteration 1668/ 11920 | consumed samples: 1708032 | elapsed time per iteration (ms): 5701.2 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.688796E+00 | loss scale: 1.0 | grad norm: 0.692 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:14:08.328513 | finish at 2025-09-10 11:54:59 + [2025-09-09 19:40:56] iteration 1669/ 11920 | consumed samples: 1709056 | elapsed time per iteration (ms): 5693.5 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.651572E+00 | loss scale: 1.0 | grad norm: 0.682 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:12:44.110479 | finish at 2025-09-10 11:53:40 + [2025-09-09 19:41:02] iteration 1670/ 11920 | consumed samples: 1710080 | elapsed time per iteration (ms): 5743.0 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.082676E+00 | loss scale: 1.0 | grad norm: 3.448 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:21:05.713656 | finish at 2025-09-10 12:02:08 + [2025-09-09 19:41:08] iteration 1671/ 11920 | consumed samples: 1711104 | elapsed time per iteration (ms): 5696.0 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.782240E+00 | loss scale: 1.0 | grad norm: 0.810 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:12:57.931153 | finish at 2025-09-10 11:54:06 + [2025-09-09 19:41:13] iteration 1672/ 11920 | consumed samples: 1712128 | elapsed time per iteration (ms): 5745.4 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.771078E+00 | loss scale: 1.0 | grad norm: 1.366 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:21:18.421354 | finish at 2025-09-10 12:02:32 + [2025-09-09 19:41:19] iteration 1673/ 11920 | consumed samples: 1713152 | elapsed time per iteration (ms): 6057.2 | throughput per GPU (TFLOP/s/GPU): 74.5 | MFU 7.54% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.881419E+00 | loss scale: 1.0 | grad norm: 1.112 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:14:28.484628 | finish at 2025-09-10 12:55:48 + [2025-09-09 19:41:25] iteration 1674/ 11920 | consumed samples: 1714176 | elapsed time per iteration (ms): 5714.2 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.700835E+00 | loss scale: 1.0 | grad norm: 0.420 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:15:47.358735 | finish at 2025-09-10 11:57:12 + [2025-09-09 19:41:31] iteration 1675/ 11920 | consumed samples: 1715200 | elapsed time per iteration (ms): 5694.4 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.665254E+00 | loss scale: 1.0 | grad norm: 0.424 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:12:18.686628 | finish at 2025-09-10 11:53:49 + [2025-09-09 19:41:36] iteration 1676/ 11920 | consumed samples: 1716224 | elapsed time per iteration (ms): 5685.6 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.657051E+00 | loss scale: 1.0 | grad norm: 0.542 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:10:43.118311 | finish at 2025-09-10 11:52:20 + [2025-09-09 19:41:42] iteration 1677/ 11920 | consumed samples: 1717248 | elapsed time per iteration (ms): 5938.4 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.754942E+00 | loss scale: 1.0 | grad norm: 1.428 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:53:46.736012 | finish at 2025-09-10 12:35:29 + [2025-09-09 19:41:48] iteration 1678/ 11920 | consumed samples: 1718272 | elapsed time per iteration (ms): 5688.5 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.670702E+00 | loss scale: 1.0 | grad norm: 0.492 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:11:01.447768 | finish at 2025-09-10 11:52:50 + [2025-09-09 19:41:54] iteration 1679/ 11920 | consumed samples: 1719296 | elapsed time per iteration (ms): 5888.3 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.670576E+00 | loss scale: 1.0 | grad norm: 0.830 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:45:02.297013 | finish at 2025-09-10 12:26:56 + [2025-09-09 19:42:00] iteration 1680/ 11920 | consumed samples: 1720320 | elapsed time per iteration (ms): 5696.2 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.737150E+00 | loss scale: 1.0 | grad norm: 1.434 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:12:08.581543 | finish at 2025-09-10 11:54:08 + [2025-09-09 19:42:05] iteration 1681/ 11920 | consumed samples: 1721344 | elapsed time per iteration (ms): 5661.7 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.710065E+00 | loss scale: 1.0 | grad norm: 0.595 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:06:09.777789 | finish at 2025-09-10 11:48:15 + [2025-09-09 19:42:11] iteration 1682/ 11920 | consumed samples: 1722368 | elapsed time per iteration (ms): 5881.7 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.620111E+00 | loss scale: 1.0 | grad norm: 0.382 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:43:36.688779 | finish at 2025-09-10 12:25:48 + [2025-09-09 19:42:17] iteration 1683/ 11920 | consumed samples: 1723392 | elapsed time per iteration (ms): 5657.3 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.574252E+00 | loss scale: 1.0 | grad norm: 0.269 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:05:14.097343 | finish at 2025-09-10 11:47:31 + [2025-09-09 19:42:23] iteration 1684/ 11920 | consumed samples: 1724416 | elapsed time per iteration (ms): 5875.4 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.559335E+00 | loss scale: 1.0 | grad norm: 0.275 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:42:20.617043 | finish at 2025-09-10 12:24:43 + [2025-09-09 19:42:29] iteration 1685/ 11920 | consumed samples: 1725440 | elapsed time per iteration (ms): 5964.5 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.542360E+00 | loss scale: 1.0 | grad norm: 0.383 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:57:26.583786 | finish at 2025-09-10 12:39:55 + [2025-09-09 19:42:34] iteration 1686/ 11920 | consumed samples: 1726464 | elapsed time per iteration (ms): 5646.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.537255E+00 | loss scale: 1.0 | grad norm: 0.613 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:03:04.056874 | finish at 2025-09-10 11:45:38 + [2025-09-09 19:42:40] iteration 1687/ 11920 | consumed samples: 1727488 | elapsed time per iteration (ms): 5652.0 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.515953E+00 | loss scale: 1.0 | grad norm: 0.315 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:03:56.800825 | finish at 2025-09-10 11:46:37 + [2025-09-09 19:42:46] iteration 1688/ 11920 | consumed samples: 1728512 | elapsed time per iteration (ms): 5981.3 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.491338E+00 | loss scale: 1.0 | grad norm: 0.269 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:00:01.114090 | finish at 2025-09-10 12:42:47 + [2025-09-09 19:42:52] iteration 1689/ 11920 | consumed samples: 1729536 | elapsed time per iteration (ms): 5634.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.468536E+00 | loss scale: 1.0 | grad norm: 0.262 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:00:42.359610 | finish at 2025-09-10 11:43:34 + [2025-09-09 19:42:57] iteration 1690/ 11920 | consumed samples: 1730560 | elapsed time per iteration (ms): 5840.9 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.452371E+00 | loss scale: 1.0 | grad norm: 0.306 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:35:52.569845 | finish at 2025-09-10 12:18:50 + [2025-09-09 19:43:03] iteration 1691/ 11920 | consumed samples: 1731584 | elapsed time per iteration (ms): 5840.1 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.478822E+00 | loss scale: 1.0 | grad norm: 0.545 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:35:38.363901 | finish at 2025-09-10 12:18:42 + [2025-09-09 19:43:09] iteration 1692/ 11920 | consumed samples: 1732608 | elapsed time per iteration (ms): 5647.6 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.470345E+00 | loss scale: 1.0 | grad norm: 0.539 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:02:44.069132 | finish at 2025-09-10 11:45:53 + [2025-09-09 19:43:15] iteration 1693/ 11920 | consumed samples: 1733632 | elapsed time per iteration (ms): 5990.5 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.424763E+00 | loss scale: 1.0 | grad norm: 0.275 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:01:04.672545 | finish at 2025-09-10 12:44:20 + [2025-09-09 19:43:21] iteration 1694/ 11920 | consumed samples: 1734656 | elapsed time per iteration (ms): 5658.2 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.412545E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:04:20.987516 | finish at 2025-09-10 11:47:42 + [2025-09-09 19:43:27] iteration 1695/ 11920 | consumed samples: 1735680 | elapsed time per iteration (ms): 5920.7 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.414773E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:48:59.504421 | finish at 2025-09-10 12:32:26 + [2025-09-09 19:43:32] iteration 1696/ 11920 | consumed samples: 1736704 | elapsed time per iteration (ms): 5644.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.390160E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:01:47.410789 | finish at 2025-09-10 11:45:20 + [2025-09-09 19:43:38] iteration 1697/ 11920 | consumed samples: 1737728 | elapsed time per iteration (ms): 5644.6 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.391319E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:01:44.235520 | finish at 2025-09-10 11:45:22 + [2025-09-09 19:43:43] iteration 1698/ 11920 | consumed samples: 1738752 | elapsed time per iteration (ms): 5636.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.375232E+00 | loss scale: 1.0 | grad norm: 0.121 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:00:15.497543 | finish at 2025-09-10 11:43:59 + [2025-09-09 19:43:49] iteration 1699/ 11920 | consumed samples: 1739776 | elapsed time per iteration (ms): 5634.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.364478E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:59:47.975536 | finish at 2025-09-10 11:43:37 + [2025-09-09 19:43:55] iteration 1700/ 11920 | consumed samples: 1740800 | elapsed time per iteration (ms): 5630.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.353209E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:59:03.245401 | finish at 2025-09-10 11:42:58 + [2025-09-09 19:44:00] iteration 1701/ 11920 | consumed samples: 1741824 | elapsed time per iteration (ms): 5635.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.349971E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:59:47.519717 | finish at 2025-09-10 11:43:48 + [2025-09-09 19:44:06] iteration 1702/ 11920 | consumed samples: 1742848 | elapsed time per iteration (ms): 5934.8 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.339322E+00 | loss scale: 1.0 | grad norm: 0.382 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:50:41.807402 | finish at 2025-09-10 12:34:48 + [2025-09-09 19:44:12] iteration 1703/ 11920 | consumed samples: 1743872 | elapsed time per iteration (ms): 5668.3 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.352868E+00 | loss scale: 1.0 | grad norm: 0.536 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:05:12.917907 | finish at 2025-09-10 11:49:25 + [2025-09-09 19:44:18] iteration 1704/ 11920 | consumed samples: 1744896 | elapsed time per iteration (ms): 5644.7 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.342238E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:01:06.733109 | finish at 2025-09-10 11:45:24 + [2025-09-09 19:44:24] iteration 1705/ 11920 | consumed samples: 1745920 | elapsed time per iteration (ms): 6101.1 | throughput per GPU (TFLOP/s/GPU): 74.0 | MFU 7.48% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.321538E+00 | loss scale: 1.0 | grad norm: 0.243 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:18:43.142892 | finish at 2025-09-10 13:03:07 + [2025-09-09 19:44:30] iteration 1706/ 11920 | consumed samples: 1746944 | elapsed time per iteration (ms): 6130.3 | throughput per GPU (TFLOP/s/GPU): 73.6 | MFU 7.45% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.321729E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:23:34.855437 | finish at 2025-09-10 13:08:05 + [2025-09-09 19:44:36] iteration 1707/ 11920 | consumed samples: 1747968 | elapsed time per iteration (ms): 5915.0 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.318272E+00 | loss scale: 1.0 | grad norm: 0.288 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:46:49.541540 | finish at 2025-09-10 12:31:25 + [2025-09-09 19:44:41] iteration 1708/ 11920 | consumed samples: 1748992 | elapsed time per iteration (ms): 5654.0 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.303483E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:02:18.891921 | finish at 2025-09-10 11:47:00 + [2025-09-09 19:44:47] iteration 1709/ 11920 | consumed samples: 1750016 | elapsed time per iteration (ms): 6010.6 | throughput per GPU (TFLOP/s/GPU): 75.1 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.313682E+00 | loss scale: 1.0 | grad norm: 0.243 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:02:54.400630 | finish at 2025-09-10 12:47:42 + [2025-09-09 19:44:53] iteration 1710/ 11920 | consumed samples: 1751040 | elapsed time per iteration (ms): 5640.7 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.300975E+00 | loss scale: 1.0 | grad norm: 0.246 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:59:51.204810 | finish at 2025-09-10 11:44:44 + [2025-09-09 19:44:59] iteration 1711/ 11920 | consumed samples: 1752064 | elapsed time per iteration (ms): 5635.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.297213E+00 | loss scale: 1.0 | grad norm: 0.126 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:58:56.511434 | finish at 2025-09-10 11:43:55 + [2025-09-09 19:45:04] iteration 1712/ 11920 | consumed samples: 1753088 | elapsed time per iteration (ms): 5639.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.293140E+00 | loss scale: 1.0 | grad norm: 0.126 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:59:28.723236 | finish at 2025-09-10 11:44:33 + [2025-09-09 19:45:10] iteration 1713/ 11920 | consumed samples: 1754112 | elapsed time per iteration (ms): 5643.4 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.288099E+00 | loss scale: 1.0 | grad norm: 0.116 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:00:02.042184 | finish at 2025-09-10 11:45:12 + [2025-09-09 19:45:16] iteration 1714/ 11920 | consumed samples: 1755136 | elapsed time per iteration (ms): 5639.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.280483E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:59:21.210846 | finish at 2025-09-10 11:44:37 + [2025-09-09 19:45:21] iteration 1715/ 11920 | consumed samples: 1756160 | elapsed time per iteration (ms): 5663.0 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.278282E+00 | loss scale: 1.0 | grad norm: 0.123 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:03:11.334577 | finish at 2025-09-10 11:48:33 + [2025-09-09 19:45:27] iteration 1716/ 11920 | consumed samples: 1757184 | elapsed time per iteration (ms): 5643.4 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.292812E+00 | loss scale: 1.0 | grad norm: 0.132 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:59:44.949026 | finish at 2025-09-10 11:45:12 + [2025-09-09 19:45:33] iteration 1717/ 11920 | consumed samples: 1758208 | elapsed time per iteration (ms): 5651.3 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.268887E+00 | loss scale: 1.0 | grad norm: 0.111 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:01:00.617234 | finish at 2025-09-10 11:46:33 + [2025-09-09 19:45:38] iteration 1718/ 11920 | consumed samples: 1759232 | elapsed time per iteration (ms): 5647.6 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.264101E+00 | loss scale: 1.0 | grad norm: 0.091 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:00:17.318038 | finish at 2025-09-10 11:45:56 + [2025-09-09 19:45:44] iteration 1719/ 11920 | consumed samples: 1760256 | elapsed time per iteration (ms): 5636.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.271264E+00 | loss scale: 1.0 | grad norm: 0.107 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:58:12.737879 | finish at 2025-09-10 11:43:57 + [2025-09-09 19:45:50] iteration 1720/ 11920 | consumed samples: 1761280 | elapsed time per iteration (ms): 5640.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.256536E+00 | loss scale: 1.0 | grad norm: 0.101 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:58:49.097843 | finish at 2025-09-10 11:44:39 + [2025-09-09 19:45:55] iteration 1721/ 11920 | consumed samples: 1762304 | elapsed time per iteration (ms): 5638.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.270735E+00 | loss scale: 1.0 | grad norm: 0.103 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:58:23.425959 | finish at 2025-09-10 11:44:19 + [2025-09-09 19:46:01] iteration 1722/ 11920 | consumed samples: 1763328 | elapsed time per iteration (ms): 5642.8 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.245561E+00 | loss scale: 1.0 | grad norm: 0.105 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:59:05.105148 | finish at 2025-09-10 11:45:06 + [2025-09-09 19:46:07] iteration 1723/ 11920 | consumed samples: 1764352 | elapsed time per iteration (ms): 5894.0 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.234702E+00 | loss scale: 1.0 | grad norm: 0.120 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:41:40.911896 | finish at 2025-09-10 12:27:48 + [2025-09-09 19:46:12] iteration 1724/ 11920 | consumed samples: 1765376 | elapsed time per iteration (ms): 5642.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.243899E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:58:49.067141 | finish at 2025-09-10 11:45:01 + [2025-09-09 19:46:18] iteration 1725/ 11920 | consumed samples: 1766400 | elapsed time per iteration (ms): 6001.9 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.256580E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:59:48.886364 | finish at 2025-09-10 12:46:07 + [2025-09-09 19:46:24] iteration 1726/ 11920 | consumed samples: 1767424 | elapsed time per iteration (ms): 5658.6 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.232513E+00 | loss scale: 1.0 | grad norm: 0.117 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:01:24.231121 | finish at 2025-09-10 11:47:48 + [2025-09-09 19:46:30] iteration 1727/ 11920 | consumed samples: 1768448 | elapsed time per iteration (ms): 5643.4 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.226759E+00 | loss scale: 1.0 | grad norm: 0.078 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:58:42.760165 | finish at 2025-09-10 11:45:12 + [2025-09-09 19:46:36] iteration 1728/ 11920 | consumed samples: 1769472 | elapsed time per iteration (ms): 5860.2 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.247328E+00 | loss scale: 1.0 | grad norm: 0.103 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:35:27.116360 | finish at 2025-09-10 12:22:03 + [2025-09-09 19:46:41] iteration 1729/ 11920 | consumed samples: 1770496 | elapsed time per iteration (ms): 5864.0 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.239326E+00 | loss scale: 1.0 | grad norm: 0.090 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:35:59.514594 | finish at 2025-09-10 12:22:41 + [2025-09-09 19:46:47] iteration 1730/ 11920 | consumed samples: 1771520 | elapsed time per iteration (ms): 5651.9 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.242324E+00 | loss scale: 1.0 | grad norm: 0.086 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:59:53.087482 | finish at 2025-09-10 11:46:40 + [2025-09-09 19:46:53] iteration 1731/ 11920 | consumed samples: 1772544 | elapsed time per iteration (ms): 5638.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.238666E+00 | loss scale: 1.0 | grad norm: 0.095 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:57:28.924760 | finish at 2025-09-10 11:44:22 + [2025-09-09 19:46:58] iteration 1732/ 11920 | consumed samples: 1773568 | elapsed time per iteration (ms): 5639.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.229776E+00 | loss scale: 1.0 | grad norm: 0.107 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:57:36.213615 | finish at 2025-09-10 11:44:35 + [2025-09-09 19:47:04] iteration 1733/ 11920 | consumed samples: 1774592 | elapsed time per iteration (ms): 5638.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.228181E+00 | loss scale: 1.0 | grad norm: 0.127 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:57:17.706395 | finish at 2025-09-10 11:44:22 + [2025-09-09 19:47:10] iteration 1734/ 11920 | consumed samples: 1775616 | elapsed time per iteration (ms): 5640.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.227921E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:57:29.155210 | finish at 2025-09-10 11:44:39 + [2025-09-09 19:47:15] iteration 1735/ 11920 | consumed samples: 1776640 | elapsed time per iteration (ms): 5641.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.235160E+00 | loss scale: 1.0 | grad norm: 0.257 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:57:34.583359 | finish at 2025-09-10 11:44:50 + [2025-09-09 19:47:21] iteration 1736/ 11920 | consumed samples: 1777664 | elapsed time per iteration (ms): 5649.7 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.239409E+00 | loss scale: 1.0 | grad norm: 0.264 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:58:56.235687 | finish at 2025-09-10 11:46:17 + [2025-09-09 19:47:27] iteration 1737/ 11920 | consumed samples: 1778688 | elapsed time per iteration (ms): 5645.6 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.233344E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:58:09.286433 | finish at 2025-09-10 11:45:36 + [2025-09-09 19:47:32] iteration 1738/ 11920 | consumed samples: 1779712 | elapsed time per iteration (ms): 5654.6 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.224444E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:59:35.485804 | finish at 2025-09-10 11:47:08 + [2025-09-09 19:47:38] iteration 1739/ 11920 | consumed samples: 1780736 | elapsed time per iteration (ms): 5650.7 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.213732E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:58:49.432958 | finish at 2025-09-10 11:46:27 + [2025-09-09 19:47:43] iteration 1740/ 11920 | consumed samples: 1781760 | elapsed time per iteration (ms): 5653.9 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.221079E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:59:16.999598 | finish at 2025-09-10 11:47:00 + [2025-09-09 19:47:49] iteration 1741/ 11920 | consumed samples: 1782784 | elapsed time per iteration (ms): 5644.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.235599E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:57:33.268865 | finish at 2025-09-10 11:45:22 + [2025-09-09 19:47:55] iteration 1742/ 11920 | consumed samples: 1783808 | elapsed time per iteration (ms): 5944.7 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.223797E+00 | loss scale: 1.0 | grad norm: 0.129 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:48:24.826606 | finish at 2025-09-10 12:36:20 + [2025-09-09 19:48:01] iteration 1743/ 11920 | consumed samples: 1784832 | elapsed time per iteration (ms): 5979.6 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.226814E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:54:14.594961 | finish at 2025-09-10 12:42:16 + [2025-09-09 19:48:07] iteration 1744/ 11920 | consumed samples: 1785856 | elapsed time per iteration (ms): 5640.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.222900E+00 | loss scale: 1.0 | grad norm: 0.133 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:56:32.146133 | finish at 2025-09-10 11:44:39 + [2025-09-09 19:48:13] iteration 1745/ 11920 | consumed samples: 1786880 | elapsed time per iteration (ms): 5860.7 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.223073E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:33:52.961029 | finish at 2025-09-10 12:22:06 + [2025-09-09 19:48:18] iteration 1746/ 11920 | consumed samples: 1787904 | elapsed time per iteration (ms): 5653.8 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.201302E+00 | loss scale: 1.0 | grad norm: 0.120 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:58:41.955362 | finish at 2025-09-10 11:47:00 + [2025-09-09 19:48:24] iteration 1747/ 11920 | consumed samples: 1788928 | elapsed time per iteration (ms): 5916.2 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.212758E+00 | loss scale: 1.0 | grad norm: 0.101 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:43:05.943241 | finish at 2025-09-10 12:31:30 + [2025-09-09 19:48:30] iteration 1748/ 11920 | consumed samples: 1789952 | elapsed time per iteration (ms): 5985.5 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.205925E+00 | loss scale: 1.0 | grad norm: 0.096 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:54:44.477887 | finish at 2025-09-10 12:43:15 + [2025-09-09 19:48:36] iteration 1749/ 11920 | consumed samples: 1790976 | elapsed time per iteration (ms): 6017.3 | throughput per GPU (TFLOP/s/GPU): 75.0 | MFU 7.59% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.198139E+00 | loss scale: 1.0 | grad norm: 0.101 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:00:02.209382 | finish at 2025-09-10 12:48:38 + [2025-09-09 19:48:42] iteration 1750/ 11920 | consumed samples: 1792000 | elapsed time per iteration (ms): 5657.2 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.214938E+00 | loss scale: 1.0 | grad norm: 0.103 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:58:53.545568 | finish at 2025-09-10 11:47:35 + [2025-09-09 19:48:47] iteration 1751/ 11920 | consumed samples: 1793024 | elapsed time per iteration (ms): 5640.4 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.212843E+00 | loss scale: 1.0 | grad norm: 0.102 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:55:57.025686 | finish at 2025-09-10 11:44:44 + [2025-09-09 19:48:53] iteration 1752/ 11920 | consumed samples: 1794048 | elapsed time per iteration (ms): 5639.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.206670E+00 | loss scale: 1.0 | grad norm: 0.098 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:55:41.889557 | finish at 2025-09-10 11:44:35 + [2025-09-09 19:48:59] iteration 1753/ 11920 | consumed samples: 1795072 | elapsed time per iteration (ms): 5642.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.206342E+00 | loss scale: 1.0 | grad norm: 0.111 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:56:03.042601 | finish at 2025-09-10 11:45:02 + [2025-09-09 19:49:04] iteration 1754/ 11920 | consumed samples: 1796096 | elapsed time per iteration (ms): 5638.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.209076E+00 | loss scale: 1.0 | grad norm: 0.124 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:55:22.973386 | finish at 2025-09-10 11:44:27 + [2025-09-09 19:49:10] iteration 1755/ 11920 | consumed samples: 1797120 | elapsed time per iteration (ms): 5641.0 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.206662E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:55:40.794412 | finish at 2025-09-10 11:44:51 + [2025-09-09 19:49:16] iteration 1756/ 11920 | consumed samples: 1798144 | elapsed time per iteration (ms): 5640.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.210587E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:55:25.637163 | finish at 2025-09-10 11:44:41 + [2025-09-09 19:49:21] iteration 1757/ 11920 | consumed samples: 1799168 | elapsed time per iteration (ms): 5640.5 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.201233E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:55:24.596042 | finish at 2025-09-10 11:44:46 + [2025-09-09 19:49:27] iteration 1758/ 11920 | consumed samples: 1800192 | elapsed time per iteration (ms): 5641.9 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.215598E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:55:33.303401 | finish at 2025-09-10 11:45:00 + [2025-09-09 19:49:33] iteration 1759/ 11920 | consumed samples: 1801216 | elapsed time per iteration (ms): 5864.1 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.206182E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:33:05.606827 | finish at 2025-09-10 12:22:38 + [2025-09-09 19:49:39] iteration 1760/ 11920 | consumed samples: 1802240 | elapsed time per iteration (ms): 5901.7 | throughput per GPU (TFLOP/s/GPU): 76.5 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.195827E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:39:21.269779 | finish at 2025-09-10 12:29:00 + [2025-09-09 19:49:44] iteration 1761/ 11920 | consumed samples: 1803264 | elapsed time per iteration (ms): 5636.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.199245E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 10.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:54:24.794265 | finish at 2025-09-10 11:44:09 + [2025-09-09 19:49:50] iteration 1762/ 11920 | consumed samples: 1804288 | elapsed time per iteration (ms): 5873.8 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.196404E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:34:25.886425 | finish at 2025-09-10 12:24:16 + [2025-09-09 19:49:56] iteration 1763/ 11920 | consumed samples: 1805312 | elapsed time per iteration (ms): 5643.6 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.195922E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:55:22.154041 | finish at 2025-09-10 11:45:18 + [2025-09-09 19:50:01] iteration 1764/ 11920 | consumed samples: 1806336 | elapsed time per iteration (ms): 5644.2 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.213116E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:55:22.825387 | finish at 2025-09-10 11:45:24 + [2025-09-09 19:50:07] iteration 1765/ 11920 | consumed samples: 1807360 | elapsed time per iteration (ms): 5639.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.196229E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:54:33.193871 | finish at 2025-09-10 11:44:40 + [2025-09-09 19:50:13] iteration 1766/ 11920 | consumed samples: 1808384 | elapsed time per iteration (ms): 5634.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.187980E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:53:31.589972 | finish at 2025-09-10 11:43:44 + [2025-09-09 19:50:18] iteration 1767/ 11920 | consumed samples: 1809408 | elapsed time per iteration (ms): 5640.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.198723E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:54:25.639471 | finish at 2025-09-10 11:44:44 + [2025-09-09 19:50:24] iteration 1768/ 11920 | consumed samples: 1810432 | elapsed time per iteration (ms): 5637.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.195362E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:53:53.887653 | finish at 2025-09-10 11:44:18 + [2025-09-09 19:50:30] iteration 1769/ 11920 | consumed samples: 1811456 | elapsed time per iteration (ms): 5648.6 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.203154E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:55:39.271196 | finish at 2025-09-10 11:46:09 + [2025-09-09 19:50:35] iteration 1770/ 11920 | consumed samples: 1812480 | elapsed time per iteration (ms): 5648.2 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.187435E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:55:29.428792 | finish at 2025-09-10 11:46:05 + [2025-09-09 19:50:41] iteration 1771/ 11920 | consumed samples: 1813504 | elapsed time per iteration (ms): 5681.8 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.176018E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:01:04.354777 | finish at 2025-09-10 11:51:45 + [2025-09-09 19:50:47] iteration 1772/ 11920 | consumed samples: 1814528 | elapsed time per iteration (ms): 5639.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.204410E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:53:53.961350 | finish at 2025-09-10 11:44:41 + [2025-09-09 19:50:52] iteration 1773/ 11920 | consumed samples: 1815552 | elapsed time per iteration (ms): 5639.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.192263E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:53:44.661125 | finish at 2025-09-10 11:44:37 + [2025-09-09 19:50:58] iteration 1774/ 11920 | consumed samples: 1816576 | elapsed time per iteration (ms): 5637.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.195658E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:53:15.828238 | finish at 2025-09-10 11:44:14 + [2025-09-09 19:51:04] iteration 1775/ 11920 | consumed samples: 1817600 | elapsed time per iteration (ms): 5642.9 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.186196E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:54:07.334081 | finish at 2025-09-10 11:45:11 + [2025-09-09 19:51:09] iteration 1776/ 11920 | consumed samples: 1818624 | elapsed time per iteration (ms): 5640.6 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.190721E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:53:37.941322 | finish at 2025-09-10 11:44:47 + [2025-09-09 19:51:15] iteration 1777/ 11920 | consumed samples: 1819648 | elapsed time per iteration (ms): 5641.7 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.188939E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:53:43.562681 | finish at 2025-09-10 11:44:58 + [2025-09-09 19:51:20] iteration 1778/ 11920 | consumed samples: 1820672 | elapsed time per iteration (ms): 5636.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.198936E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:52:40.175758 | finish at 2025-09-10 11:44:01 + [2025-09-09 19:51:26] iteration 1779/ 11920 | consumed samples: 1821696 | elapsed time per iteration (ms): 5644.2 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.197053E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:53:57.995071 | finish at 2025-09-10 11:45:24 + [2025-09-09 19:51:32] iteration 1780/ 11920 | consumed samples: 1822720 | elapsed time per iteration (ms): 5640.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.178882E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:53:12.618184 | finish at 2025-09-10 11:44:44 + [2025-09-09 19:51:37] iteration 1781/ 11920 | consumed samples: 1823744 | elapsed time per iteration (ms): 5656.7 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.191186E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:55:53.082023 | finish at 2025-09-10 11:47:31 + [2025-09-09 19:51:43] iteration 1782/ 11920 | consumed samples: 1824768 | elapsed time per iteration (ms): 5643.9 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.192643E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:53:37.424706 | finish at 2025-09-10 11:45:20 + [2025-09-09 19:51:49] iteration 1783/ 11920 | consumed samples: 1825792 | elapsed time per iteration (ms): 5652.8 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.191893E+00 | loss scale: 1.0 | grad norm: 0.136 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:55:02.470696 | finish at 2025-09-10 11:46:51 + [2025-09-09 19:51:54] iteration 1784/ 11920 | consumed samples: 1826816 | elapsed time per iteration (ms): 5640.9 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.180135E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:52:56.173435 | finish at 2025-09-10 11:44:51 + [2025-09-09 19:52:00] iteration 1785/ 11920 | consumed samples: 1827840 | elapsed time per iteration (ms): 5640.8 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.176414E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:52:49.046465 | finish at 2025-09-10 11:44:49 + [2025-09-09 19:52:06] iteration 1786/ 11920 | consumed samples: 1828864 | elapsed time per iteration (ms): 5962.9 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.188342E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:47:08.378162 | finish at 2025-09-10 12:39:14 + [2025-09-09 19:52:12] iteration 1787/ 11920 | consumed samples: 1829888 | elapsed time per iteration (ms): 5640.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.181551E+00 | loss scale: 1.0 | grad norm: 0.112 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:52:33.225489 | finish at 2025-09-10 11:44:45 + [2025-09-09 19:52:17] iteration 1788/ 11920 | consumed samples: 1830912 | elapsed time per iteration (ms): 5638.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.176837E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:52:13.166125 | finish at 2025-09-10 11:44:30 + [2025-09-09 19:52:23] iteration 1789/ 11920 | consumed samples: 1831936 | elapsed time per iteration (ms): 5640.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.173599E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:52:20.205774 | finish at 2025-09-10 11:44:43 + [2025-09-09 19:52:29] iteration 1790/ 11920 | consumed samples: 1832960 | elapsed time per iteration (ms): 5643.8 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.177006E+00 | loss scale: 1.0 | grad norm: 0.123 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:52:51.616919 | finish at 2025-09-10 11:45:20 + [2025-09-09 19:52:34] iteration 1791/ 11920 | consumed samples: 1833984 | elapsed time per iteration (ms): 5656.0 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.189552E+00 | loss scale: 1.0 | grad norm: 0.119 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:54:49.977973 | finish at 2025-09-10 11:47:24 + [2025-09-09 19:52:40] iteration 1792/ 11920 | consumed samples: 1835008 | elapsed time per iteration (ms): 5646.4 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.178566E+00 | loss scale: 1.0 | grad norm: 0.123 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:53:06.811386 | finish at 2025-09-10 11:45:47 + [2025-09-09 19:52:45] iteration 1793/ 11920 | consumed samples: 1836032 | elapsed time per iteration (ms): 5641.0 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.192171E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:52:05.936508 | finish at 2025-09-10 11:44:51 + [2025-09-09 19:52:51] iteration 1794/ 11920 | consumed samples: 1837056 | elapsed time per iteration (ms): 5659.3 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.182815E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:55:06.009930 | finish at 2025-09-10 11:47:57 + [2025-09-09 19:52:57] iteration 1795/ 11920 | consumed samples: 1838080 | elapsed time per iteration (ms): 5656.5 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.169493E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:54:32.358030 | finish at 2025-09-10 11:47:29 + [2025-09-09 19:53:02] iteration 1796/ 11920 | consumed samples: 1839104 | elapsed time per iteration (ms): 5637.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.182757E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:51:13.618422 | finish at 2025-09-10 11:44:16 + [2025-09-09 19:53:08] iteration 1797/ 11920 | consumed samples: 1840128 | elapsed time per iteration (ms): 5643.8 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.166689E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:52:12.243115 | finish at 2025-09-10 11:45:20 + [2025-09-09 19:53:14] iteration 1798/ 11920 | consumed samples: 1841152 | elapsed time per iteration (ms): 5642.5 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.191167E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:51:53.413187 | finish at 2025-09-10 11:45:07 + [2025-09-09 19:53:19] iteration 1799/ 11920 | consumed samples: 1842176 | elapsed time per iteration (ms): 5644.5 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.174584E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:52:08.020869 | finish at 2025-09-10 11:45:27 + [2025-09-09 19:53:25] iteration 1800/ 11920 | consumed samples: 1843200 | elapsed time per iteration (ms): 5643.0 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.173977E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:51:47.356710 | finish at 2025-09-10 11:45:12 + [2025-09-09 19:53:31] iteration 1801/ 11920 | consumed samples: 1844224 | elapsed time per iteration (ms): 5648.6 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.184570E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:52:38.580086 | finish at 2025-09-10 11:46:09 + [2025-09-09 19:53:37] iteration 1802/ 11920 | consumed samples: 1845248 | elapsed time per iteration (ms): 5990.4 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.184668E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:50:11.188807 | finish at 2025-09-10 12:43:48 + [2025-09-09 19:53:42] iteration 1803/ 11920 | consumed samples: 1846272 | elapsed time per iteration (ms): 5848.9 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.183065E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:26:13.676682 | finish at 2025-09-10 12:19:56 + [2025-09-09 19:53:48] iteration 1804/ 11920 | consumed samples: 1847296 | elapsed time per iteration (ms): 5639.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.176361E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:50:44.140265 | finish at 2025-09-10 11:44:32 + [2025-09-09 19:53:54] iteration 1805/ 11920 | consumed samples: 1848320 | elapsed time per iteration (ms): 5980.3 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.184079E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:48:10.229965 | finish at 2025-09-10 12:42:04 + [2025-09-09 19:54:00] iteration 1806/ 11920 | consumed samples: 1849344 | elapsed time per iteration (ms): 5631.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.190254E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:49:12.216554 | finish at 2025-09-10 11:43:12 + [2025-09-09 19:54:05] iteration 1807/ 11920 | consumed samples: 1850368 | elapsed time per iteration (ms): 5645.5 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.172714E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:51:33.232687 | finish at 2025-09-10 11:45:39 + [2025-09-09 19:54:11] iteration 1808/ 11920 | consumed samples: 1851392 | elapsed time per iteration (ms): 5641.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.187161E+00 | loss scale: 1.0 | grad norm: 0.126 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:50:42.797668 | finish at 2025-09-10 11:44:54 + [2025-09-09 19:54:17] iteration 1809/ 11920 | consumed samples: 1852416 | elapsed time per iteration (ms): 5636.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.167800E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:49:45.542136 | finish at 2025-09-10 11:44:02 + [2025-09-09 19:54:22] iteration 1810/ 11920 | consumed samples: 1853440 | elapsed time per iteration (ms): 5639.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.173034E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:50:10.274920 | finish at 2025-09-10 11:44:33 + [2025-09-09 19:54:28] iteration 1811/ 11920 | consumed samples: 1854464 | elapsed time per iteration (ms): 5640.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.167604E+00 | loss scale: 1.0 | grad norm: 0.123 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:50:14.987617 | finish at 2025-09-10 11:44:43 + [2025-09-09 19:54:34] iteration 1812/ 11920 | consumed samples: 1855488 | elapsed time per iteration (ms): 5649.7 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.161576E+00 | loss scale: 1.0 | grad norm: 0.115 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:51:47.627153 | finish at 2025-09-10 11:46:21 + [2025-09-09 19:54:39] iteration 1813/ 11920 | consumed samples: 1856512 | elapsed time per iteration (ms): 5651.4 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.164299E+00 | loss scale: 1.0 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:51:59.144086 | finish at 2025-09-10 11:46:38 + [2025-09-09 19:54:45] iteration 1814/ 11920 | consumed samples: 1857536 | elapsed time per iteration (ms): 5637.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.173567E+00 | loss scale: 1.0 | grad norm: 0.124 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:49:27.055413 | finish at 2025-09-10 11:44:12 + [2025-09-09 19:54:51] iteration 1815/ 11920 | consumed samples: 1858560 | elapsed time per iteration (ms): 5640.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.160707E+00 | loss scale: 1.0 | grad norm: 0.116 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:49:52.986466 | finish at 2025-09-10 11:44:44 + [2025-09-09 19:54:56] iteration 1816/ 11920 | consumed samples: 1859584 | elapsed time per iteration (ms): 5960.2 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.183636E+00 | loss scale: 1.0 | grad norm: 0.113 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:43:42.273457 | finish at 2025-09-10 12:38:39 + [2025-09-09 19:55:02] iteration 1817/ 11920 | consumed samples: 1860608 | elapsed time per iteration (ms): 5642.5 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.167594E+00 | loss scale: 1.0 | grad norm: 0.114 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:50:06.405560 | finish at 2025-09-10 11:45:09 + [2025-09-09 19:55:08] iteration 1818/ 11920 | consumed samples: 1861632 | elapsed time per iteration (ms): 5895.2 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.172122E+00 | loss scale: 1.0 | grad norm: 0.119 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:32:33.575478 | finish at 2025-09-10 12:27:42 + [2025-09-09 19:55:14] iteration 1819/ 11920 | consumed samples: 1862656 | elapsed time per iteration (ms): 5639.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.160083E+00 | loss scale: 1.0 | grad norm: 0.130 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:49:28.446560 | finish at 2025-09-10 11:44:42 + [2025-09-09 19:55:19] iteration 1820/ 11920 | consumed samples: 1863680 | elapsed time per iteration (ms): 5639.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.160329E+00 | loss scale: 1.0 | grad norm: 0.122 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:49:16.764936 | finish at 2025-09-10 11:44:36 + [2025-09-09 19:55:25] iteration 1821/ 11920 | consumed samples: 1864704 | elapsed time per iteration (ms): 5861.9 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.168496E+00 | loss scale: 1.0 | grad norm: 0.131 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:26:39.504789 | finish at 2025-09-10 12:22:05 + [2025-09-09 19:55:31] iteration 1822/ 11920 | consumed samples: 1865728 | elapsed time per iteration (ms): 6082.7 | throughput per GPU (TFLOP/s/GPU): 74.2 | MFU 7.50% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.163781E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:03:43.477913 | finish at 2025-09-10 12:59:15 + [2025-09-09 19:55:37] iteration 1823/ 11920 | consumed samples: 1866752 | elapsed time per iteration (ms): 5927.7 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.162362E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:37:31.840485 | finish at 2025-09-10 12:33:09 + [2025-09-09 19:55:43] iteration 1824/ 11920 | consumed samples: 1867776 | elapsed time per iteration (ms): 5903.3 | throughput per GPU (TFLOP/s/GPU): 76.5 | MFU 7.73% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.167824E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:33:19.938725 | finish at 2025-09-10 12:29:03 + [2025-09-09 19:55:49] iteration 1825/ 11920 | consumed samples: 1868800 | elapsed time per iteration (ms): 5641.0 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.168149E+00 | loss scale: 1.0 | grad norm: 0.240 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:49:06.237098 | finish at 2025-09-10 11:44:55 + [2025-09-09 19:55:55] iteration 1826/ 11920 | consumed samples: 1869824 | elapsed time per iteration (ms): 5965.2 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.158926E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:43:32.846192 | finish at 2025-09-10 12:39:28 + [2025-09-09 19:56:00] iteration 1827/ 11920 | consumed samples: 1870848 | elapsed time per iteration (ms): 5638.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.181913E+00 | loss scale: 1.0 | grad norm: 0.253 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:48:24.926080 | finish at 2025-09-10 11:44:25 + [2025-09-09 19:56:06] iteration 1828/ 11920 | consumed samples: 1871872 | elapsed time per iteration (ms): 5639.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.175388E+00 | loss scale: 1.0 | grad norm: 0.275 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:48:31.989930 | finish at 2025-09-10 11:44:38 + [2025-09-09 19:56:12] iteration 1829/ 11920 | consumed samples: 1872896 | elapsed time per iteration (ms): 5644.4 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.177090E+00 | loss scale: 1.0 | grad norm: 0.278 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:49:17.571838 | finish at 2025-09-10 11:45:29 + [2025-09-09 19:56:17] iteration 1830/ 11920 | consumed samples: 1873920 | elapsed time per iteration (ms): 5639.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.178968E+00 | loss scale: 1.0 | grad norm: 0.259 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:48:19.152439 | finish at 2025-09-10 11:44:36 + [2025-09-09 19:56:23] iteration 1831/ 11920 | consumed samples: 1874944 | elapsed time per iteration (ms): 5944.7 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.182753E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:39:36.090354 | finish at 2025-09-10 12:35:59 + [2025-09-09 19:56:29] iteration 1832/ 11920 | consumed samples: 1875968 | elapsed time per iteration (ms): 5852.6 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.173420E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:24:01.207767 | finish at 2025-09-10 12:20:30 + [2025-09-09 19:56:35] iteration 1833/ 11920 | consumed samples: 1876992 | elapsed time per iteration (ms): 5653.5 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.170963E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:50:26.619625 | finish at 2025-09-10 11:47:01 + [2025-09-09 19:56:41] iteration 1834/ 11920 | consumed samples: 1878016 | elapsed time per iteration (ms): 5907.1 | throughput per GPU (TFLOP/s/GPU): 76.4 | MFU 7.73% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.173534E+00 | loss scale: 1.0 | grad norm: 0.131 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:32:58.707228 | finish at 2025-09-10 12:29:39 + [2025-09-09 19:56:46] iteration 1835/ 11920 | consumed samples: 1879040 | elapsed time per iteration (ms): 5882.7 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.166653E+00 | loss scale: 1.0 | grad norm: 0.127 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:28:46.913748 | finish at 2025-09-10 12:25:33 + [2025-09-09 19:56:52] iteration 1836/ 11920 | consumed samples: 1880064 | elapsed time per iteration (ms): 5973.1 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.161741E+00 | loss scale: 1.0 | grad norm: 0.122 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:43:52.941819 | finish at 2025-09-10 12:40:45 + [2025-09-09 19:56:58] iteration 1837/ 11920 | consumed samples: 1881088 | elapsed time per iteration (ms): 5640.4 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.158477E+00 | loss scale: 1.0 | grad norm: 0.124 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:47:52.282338 | finish at 2025-09-10 11:44:50 + [2025-09-09 19:57:04] iteration 1838/ 11920 | consumed samples: 1882112 | elapsed time per iteration (ms): 5639.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.149471E+00 | loss scale: 1.0 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:47:35.159277 | finish at 2025-09-10 11:44:39 + [2025-09-09 19:57:09] iteration 1839/ 11920 | consumed samples: 1883136 | elapsed time per iteration (ms): 5635.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.177913E+00 | loss scale: 1.0 | grad norm: 0.126 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:46:51.544740 | finish at 2025-09-10 11:44:01 + [2025-09-09 19:57:15] iteration 1840/ 11920 | consumed samples: 1884160 | elapsed time per iteration (ms): 5858.3 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.163523E+00 | loss scale: 1.0 | grad norm: 0.132 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:24:12.007370 | finish at 2025-09-10 12:21:27 + [2025-09-09 19:57:21] iteration 1841/ 11920 | consumed samples: 1885184 | elapsed time per iteration (ms): 5632.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.170890E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:46:13.146024 | finish at 2025-09-10 11:43:34 + [2025-09-09 19:57:27] iteration 1842/ 11920 | consumed samples: 1886208 | elapsed time per iteration (ms): 5638.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.164579E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:47:04.185235 | finish at 2025-09-10 11:44:31 + [2025-09-09 19:57:33] iteration 1843/ 11920 | consumed samples: 1887232 | elapsed time per iteration (ms): 6007.5 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.143336E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:48:57.108331 | finish at 2025-09-10 12:46:30 + [2025-09-09 19:57:38] iteration 1844/ 11920 | consumed samples: 1888256 | elapsed time per iteration (ms): 5848.2 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.154952E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:22:06.773932 | finish at 2025-09-10 12:19:45 + [2025-09-09 19:57:44] iteration 1845/ 11920 | consumed samples: 1889280 | elapsed time per iteration (ms): 5643.4 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.163032E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:47:37.391453 | finish at 2025-09-10 11:45:21 + [2025-09-09 19:57:50] iteration 1846/ 11920 | consumed samples: 1890304 | elapsed time per iteration (ms): 5636.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.155670E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:46:17.879796 | finish at 2025-09-10 11:44:08 + [2025-09-09 19:57:56] iteration 1847/ 11920 | consumed samples: 1891328 | elapsed time per iteration (ms): 6307.3 | throughput per GPU (TFLOP/s/GPU): 71.6 | MFU 7.24% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.145726E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:38:53.522673 | finish at 2025-09-10 13:36:49 + [2025-09-09 19:58:02] iteration 1848/ 11920 | consumed samples: 1892352 | elapsed time per iteration (ms): 5634.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.161147E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:45:47.034214 | finish at 2025-09-10 11:43:49 + [2025-09-09 19:58:07] iteration 1849/ 11920 | consumed samples: 1893376 | elapsed time per iteration (ms): 5636.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.165350E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:46:00.532149 | finish at 2025-09-10 11:44:08 + [2025-09-09 19:58:13] iteration 1850/ 11920 | consumed samples: 1894400 | elapsed time per iteration (ms): 5639.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.160117E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:46:34.061587 | finish at 2025-09-10 11:44:47 + [2025-09-09 19:58:19] iteration 1851/ 11920 | consumed samples: 1895424 | elapsed time per iteration (ms): 5642.7 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.166080E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:46:56.144212 | finish at 2025-09-10 11:45:15 + [2025-09-09 19:58:24] iteration 1852/ 11920 | consumed samples: 1896448 | elapsed time per iteration (ms): 5635.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.172615E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:45:34.720960 | finish at 2025-09-10 11:43:59 + [2025-09-09 19:58:30] iteration 1853/ 11920 | consumed samples: 1897472 | elapsed time per iteration (ms): 5641.0 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.163763E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:46:27.760114 | finish at 2025-09-10 11:44:58 + [2025-09-09 19:58:35] iteration 1854/ 11920 | consumed samples: 1898496 | elapsed time per iteration (ms): 5639.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.154589E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:46:04.566107 | finish at 2025-09-10 11:44:40 + [2025-09-09 19:58:41] iteration 1855/ 11920 | consumed samples: 1899520 | elapsed time per iteration (ms): 5903.5 | throughput per GPU (TFLOP/s/GPU): 76.5 | MFU 7.73% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.149788E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:30:18.939478 | finish at 2025-09-10 12:29:00 + [2025-09-09 19:58:47] iteration 1856/ 11920 | consumed samples: 1900544 | elapsed time per iteration (ms): 5909.8 | throughput per GPU (TFLOP/s/GPU): 76.4 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.155441E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:31:16.525261 | finish at 2025-09-10 12:30:04 + [2025-09-09 19:58:53] iteration 1857/ 11920 | consumed samples: 1901568 | elapsed time per iteration (ms): 5921.3 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.176257E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:33:06.285959 | finish at 2025-09-10 12:31:59 + [2025-09-09 19:58:59] iteration 1858/ 11920 | consumed samples: 1902592 | elapsed time per iteration (ms): 5831.7 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.146053E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:17:59.053262 | finish at 2025-09-10 12:16:58 + [2025-09-09 19:59:05] iteration 1859/ 11920 | consumed samples: 1903616 | elapsed time per iteration (ms): 6010.8 | throughput per GPU (TFLOP/s/GPU): 75.1 | MFU 7.59% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.165553E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:47:54.173098 | finish at 2025-09-10 12:46:59 + [2025-09-09 19:59:11] iteration 1860/ 11920 | consumed samples: 1904640 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.170467E+00 | loss scale: 1.0 | grad norm: 0.134 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:42:57.685375 | finish at 2025-09-10 11:42:08 + [2025-09-09 19:59:16] iteration 1861/ 11920 | consumed samples: 1905664 | elapsed time per iteration (ms): 5628.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.163057E+00 | loss scale: 1.0 | grad norm: 0.134 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:43:39.841735 | finish at 2025-09-10 11:42:56 + [2025-09-09 19:59:22] iteration 1862/ 11920 | consumed samples: 1906688 | elapsed time per iteration (ms): 5631.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.150838E+00 | loss scale: 1.0 | grad norm: 0.122 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:44:02.984334 | finish at 2025-09-10 11:43:25 + [2025-09-09 19:59:28] iteration 1863/ 11920 | consumed samples: 1907712 | elapsed time per iteration (ms): 5637.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.161350E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:44:58.766926 | finish at 2025-09-10 11:44:26 + [2025-09-09 19:59:33] iteration 1864/ 11920 | consumed samples: 1908736 | elapsed time per iteration (ms): 5643.5 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.160098E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:45:50.576574 | finish at 2025-09-10 11:45:24 + [2025-09-09 19:59:39] iteration 1865/ 11920 | consumed samples: 1909760 | elapsed time per iteration (ms): 5637.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.144396E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:44:49.097633 | finish at 2025-09-10 11:44:28 + [2025-09-09 19:59:44] iteration 1866/ 11920 | consumed samples: 1910784 | elapsed time per iteration (ms): 5638.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.164496E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:44:50.670090 | finish at 2025-09-10 11:44:35 + [2025-09-09 19:59:50] iteration 1867/ 11920 | consumed samples: 1911808 | elapsed time per iteration (ms): 5632.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.145796E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:43:46.297349 | finish at 2025-09-10 11:43:36 + [2025-09-09 19:59:56] iteration 1868/ 11920 | consumed samples: 1912832 | elapsed time per iteration (ms): 5627.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.180494E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:42:44.205857 | finish at 2025-09-10 11:42:40 + [2025-09-09 20:00:01] iteration 1869/ 11920 | consumed samples: 1913856 | elapsed time per iteration (ms): 5633.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.168230E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:43:41.719996 | finish at 2025-09-10 11:43:43 + [2025-09-09 20:00:07] iteration 1870/ 11920 | consumed samples: 1914880 | elapsed time per iteration (ms): 5632.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.166524E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:43:27.122719 | finish at 2025-09-10 11:43:34 + [2025-09-09 20:00:13] iteration 1871/ 11920 | consumed samples: 1915904 | elapsed time per iteration (ms): 5634.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.155161E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:43:43.328509 | finish at 2025-09-10 11:43:56 + [2025-09-09 20:00:18] iteration 1872/ 11920 | consumed samples: 1916928 | elapsed time per iteration (ms): 5881.6 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.174130E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:24:58.744644 | finish at 2025-09-10 12:25:17 + [2025-09-09 20:00:24] iteration 1873/ 11920 | consumed samples: 1917952 | elapsed time per iteration (ms): 5632.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.169781E+00 | loss scale: 1.0 | grad norm: 0.401 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:43:05.103724 | finish at 2025-09-10 11:43:29 + [2025-09-09 20:00:30] iteration 1874/ 11920 | consumed samples: 1918976 | elapsed time per iteration (ms): 5631.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.192116E+00 | loss scale: 1.0 | grad norm: 0.294 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:42:49.361743 | finish at 2025-09-10 11:43:19 + [2025-09-09 20:00:36] iteration 1875/ 11920 | consumed samples: 1920000 | elapsed time per iteration (ms): 5855.7 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.197494E+00 | loss scale: 1.0 | grad norm: 0.309 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:20:20.887452 | finish at 2025-09-10 12:20:56 + [2025-09-09 20:00:41] iteration 1876/ 11920 | consumed samples: 1921024 | elapsed time per iteration (ms): 5641.7 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.195378E+00 | loss scale: 1.0 | grad norm: 0.242 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:44:25.668531 | finish at 2025-09-10 11:45:07 + [2025-09-09 20:00:47] iteration 1877/ 11920 | consumed samples: 1922048 | elapsed time per iteration (ms): 5637.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.208831E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:43:37.013108 | finish at 2025-09-10 11:44:24 + [2025-09-09 20:00:53] iteration 1878/ 11920 | consumed samples: 1923072 | elapsed time per iteration (ms): 5636.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.215657E+00 | loss scale: 1.0 | grad norm: 0.293 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:43:23.268888 | finish at 2025-09-10 11:44:16 + [2025-09-09 20:00:58] iteration 1879/ 11920 | consumed samples: 1924096 | elapsed time per iteration (ms): 5632.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.217173E+00 | loss scale: 1.0 | grad norm: 0.323 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:42:33.336776 | finish at 2025-09-10 11:43:31 + [2025-09-09 20:01:04] iteration 1880/ 11920 | consumed samples: 1925120 | elapsed time per iteration (ms): 5632.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.213536E+00 | loss scale: 1.0 | grad norm: 0.288 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:42:29.921122 | finish at 2025-09-10 11:43:34 + [2025-09-09 20:01:09] iteration 1881/ 11920 | consumed samples: 1926144 | elapsed time per iteration (ms): 5637.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.207167E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:43:18.242578 | finish at 2025-09-10 11:44:28 + [2025-09-09 20:01:15] iteration 1882/ 11920 | consumed samples: 1927168 | elapsed time per iteration (ms): 5642.0 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.203331E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:43:54.041398 | finish at 2025-09-10 11:45:09 + [2025-09-09 20:01:21] iteration 1883/ 11920 | consumed samples: 1928192 | elapsed time per iteration (ms): 5639.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.208316E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:43:18.915190 | finish at 2025-09-10 11:44:40 + [2025-09-09 20:01:26] iteration 1884/ 11920 | consumed samples: 1929216 | elapsed time per iteration (ms): 5636.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.212076E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:42:47.405546 | finish at 2025-09-10 11:44:14 + [2025-09-09 20:01:32] iteration 1885/ 11920 | consumed samples: 1930240 | elapsed time per iteration (ms): 5636.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.190242E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:42:41.436535 | finish at 2025-09-10 11:44:13 + [2025-09-09 20:01:38] iteration 1886/ 11920 | consumed samples: 1931264 | elapsed time per iteration (ms): 5636.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.211471E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:42:34.070492 | finish at 2025-09-10 11:44:12 + [2025-09-09 20:01:43] iteration 1887/ 11920 | consumed samples: 1932288 | elapsed time per iteration (ms): 5635.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.203164E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:42:19.466439 | finish at 2025-09-10 11:44:03 + [2025-09-09 20:01:49] iteration 1888/ 11920 | consumed samples: 1933312 | elapsed time per iteration (ms): 5635.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.194129E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:42:11.209660 | finish at 2025-09-10 11:44:00 + [2025-09-09 20:01:55] iteration 1889/ 11920 | consumed samples: 1934336 | elapsed time per iteration (ms): 5639.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.199011E+00 | loss scale: 1.0 | grad norm: 0.129 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:42:48.909942 | finish at 2025-09-10 11:44:43 + [2025-09-09 20:02:00] iteration 1890/ 11920 | consumed samples: 1935360 | elapsed time per iteration (ms): 5636.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.181382E+00 | loss scale: 1.0 | grad norm: 0.128 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:42:09.686577 | finish at 2025-09-10 11:44:10 + [2025-09-09 20:02:06] iteration 1891/ 11920 | consumed samples: 1936384 | elapsed time per iteration (ms): 5636.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.189108E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:42:04.679376 | finish at 2025-09-10 11:44:10 + [2025-09-09 20:02:11] iteration 1892/ 11920 | consumed samples: 1937408 | elapsed time per iteration (ms): 5630.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.193890E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:41:07.056360 | finish at 2025-09-10 11:43:18 + [2025-09-09 20:02:17] iteration 1893/ 11920 | consumed samples: 1938432 | elapsed time per iteration (ms): 5639.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.170759E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:42:25.764213 | finish at 2025-09-10 11:44:43 + [2025-09-09 20:02:23] iteration 1894/ 11920 | consumed samples: 1939456 | elapsed time per iteration (ms): 5638.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.186160E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:42:09.882065 | finish at 2025-09-10 11:44:33 + [2025-09-09 20:02:28] iteration 1895/ 11920 | consumed samples: 1940480 | elapsed time per iteration (ms): 5637.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.180112E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:41:52.197399 | finish at 2025-09-10 11:44:21 + [2025-09-09 20:02:34] iteration 1896/ 11920 | consumed samples: 1941504 | elapsed time per iteration (ms): 5643.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.190955E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:42:46.214762 | finish at 2025-09-10 11:45:20 + [2025-09-09 20:02:40] iteration 1897/ 11920 | consumed samples: 1942528 | elapsed time per iteration (ms): 5644.5 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.162027E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:42:54.833231 | finish at 2025-09-10 11:45:34 + [2025-09-09 20:02:46] iteration 1898/ 11920 | consumed samples: 1943552 | elapsed time per iteration (ms): 5977.7 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.174040E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:38:28.788914 | finish at 2025-09-10 12:41:14 + [2025-09-09 20:02:51] iteration 1899/ 11920 | consumed samples: 1944576 | elapsed time per iteration (ms): 5641.9 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.163966E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:42:17.537868 | finish at 2025-09-10 11:45:09 + [2025-09-09 20:02:57] iteration 1900/ 11920 | consumed samples: 1945600 | elapsed time per iteration (ms): 5633.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.193845E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:40:43.241873 | finish at 2025-09-10 11:43:40 + [2025-09-09 20:03:03] iteration 1901/ 11920 | consumed samples: 1946624 | elapsed time per iteration (ms): 6031.6 | throughput per GPU (TFLOP/s/GPU): 74.9 | MFU 7.57% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.180792E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:47:10.829701 | finish at 2025-09-10 12:50:14 + [2025-09-09 20:03:09] iteration 1902/ 11920 | consumed samples: 1947648 | elapsed time per iteration (ms): 5631.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.164463E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:40:13.410122 | finish at 2025-09-10 11:43:22 + [2025-09-09 20:03:14] iteration 1903/ 11920 | consumed samples: 1948672 | elapsed time per iteration (ms): 5635.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.167835E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:40:50.480629 | finish at 2025-09-10 11:44:05 + [2025-09-09 20:03:20] iteration 1904/ 11920 | consumed samples: 1949696 | elapsed time per iteration (ms): 5631.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.178459E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:40:07.253258 | finish at 2025-09-10 11:43:27 + [2025-09-09 20:03:25] iteration 1905/ 11920 | consumed samples: 1950720 | elapsed time per iteration (ms): 5628.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.158560E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:39:29.792675 | finish at 2025-09-10 11:42:55 + [2025-09-09 20:03:31] iteration 1906/ 11920 | consumed samples: 1951744 | elapsed time per iteration (ms): 5629.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.162495E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:39:32.857112 | finish at 2025-09-10 11:43:04 + [2025-09-09 20:03:37] iteration 1907/ 11920 | consumed samples: 1952768 | elapsed time per iteration (ms): 5640.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.176321E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:41:16.560596 | finish at 2025-09-10 11:44:53 + [2025-09-09 20:03:42] iteration 1908/ 11920 | consumed samples: 1953792 | elapsed time per iteration (ms): 5633.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.161247E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:40:04.421923 | finish at 2025-09-10 11:43:47 + [2025-09-09 20:03:48] iteration 1909/ 11920 | consumed samples: 1954816 | elapsed time per iteration (ms): 5837.4 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.164528E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:13:58.187127 | finish at 2025-09-10 12:17:46 + [2025-09-09 20:03:54] iteration 1910/ 11920 | consumed samples: 1955840 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.160830E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:38:25.880082 | finish at 2025-09-10 11:42:20 + [2025-09-09 20:03:59] iteration 1911/ 11920 | consumed samples: 1956864 | elapsed time per iteration (ms): 5629.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.158326E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:39:00.722529 | finish at 2025-09-10 11:43:00 + [2025-09-09 20:04:05] iteration 1912/ 11920 | consumed samples: 1957888 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.172814E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:38:07.490965 | finish at 2025-09-10 11:42:13 + [2025-09-09 20:04:11] iteration 1913/ 11920 | consumed samples: 1958912 | elapsed time per iteration (ms): 5622.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.155111E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:37:44.096869 | finish at 2025-09-10 11:41:55 + [2025-09-09 20:04:16] iteration 1914/ 11920 | consumed samples: 1959936 | elapsed time per iteration (ms): 5633.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.163961E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:39:25.922554 | finish at 2025-09-10 11:43:42 + [2025-09-09 20:04:22] iteration 1915/ 11920 | consumed samples: 1960960 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.155880E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:38:21.525557 | finish at 2025-09-10 11:42:43 + [2025-09-09 20:04:28] iteration 1916/ 11920 | consumed samples: 1961984 | elapsed time per iteration (ms): 5637.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.167417E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:39:59.169987 | finish at 2025-09-10 11:44:27 + [2025-09-09 20:04:33] iteration 1917/ 11920 | consumed samples: 1963008 | elapsed time per iteration (ms): 5636.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.157519E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:39:40.944817 | finish at 2025-09-10 11:44:14 + [2025-09-09 20:04:39] iteration 1918/ 11920 | consumed samples: 1964032 | elapsed time per iteration (ms): 5639.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.166543E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:40:07.589592 | finish at 2025-09-10 11:44:46 + [2025-09-09 20:04:44] iteration 1919/ 11920 | consumed samples: 1965056 | elapsed time per iteration (ms): 5634.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.161322E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:39:12.375397 | finish at 2025-09-10 11:43:57 + [2025-09-09 20:04:50] iteration 1920/ 11920 | consumed samples: 1966080 | elapsed time per iteration (ms): 5627.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.167336E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:37:59.218197 | finish at 2025-09-10 11:42:49 + [2025-09-09 20:04:56] iteration 1921/ 11920 | consumed samples: 1967104 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.163132E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:37:01.355604 | finish at 2025-09-10 11:41:57 + [2025-09-09 20:05:01] iteration 1922/ 11920 | consumed samples: 1968128 | elapsed time per iteration (ms): 5630.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.159290E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:38:14.462046 | finish at 2025-09-10 11:43:16 + [2025-09-09 20:05:07] iteration 1923/ 11920 | consumed samples: 1969152 | elapsed time per iteration (ms): 5862.6 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.179114E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:16:48.298674 | finish at 2025-09-10 12:21:56 + [2025-09-09 20:05:13] iteration 1924/ 11920 | consumed samples: 1970176 | elapsed time per iteration (ms): 5626.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.155881E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:37:26.918575 | finish at 2025-09-10 11:42:40 + [2025-09-09 20:05:18] iteration 1925/ 11920 | consumed samples: 1971200 | elapsed time per iteration (ms): 5629.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.160511E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:37:43.107940 | finish at 2025-09-10 11:43:02 + [2025-09-09 20:05:24] iteration 1926/ 11920 | consumed samples: 1972224 | elapsed time per iteration (ms): 5627.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.169582E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:37:22.317343 | finish at 2025-09-10 11:42:46 + [2025-09-09 20:05:30] iteration 1927/ 11920 | consumed samples: 1973248 | elapsed time per iteration (ms): 5631.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.155834E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:37:52.272624 | finish at 2025-09-10 11:43:22 + [2025-09-09 20:05:35] iteration 1928/ 11920 | consumed samples: 1974272 | elapsed time per iteration (ms): 5638.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.147094E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:38:58.414740 | finish at 2025-09-10 11:44:34 + [2025-09-09 20:05:41] iteration 1929/ 11920 | consumed samples: 1975296 | elapsed time per iteration (ms): 5630.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.138961E+00 | loss scale: 1.0 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:37:29.257300 | finish at 2025-09-10 11:43:10 + [2025-09-09 20:05:47] iteration 1930/ 11920 | consumed samples: 1976320 | elapsed time per iteration (ms): 5844.8 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.144432E+00 | loss scale: 1.0 | grad norm: 0.112 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:13:09.840152 | finish at 2025-09-10 12:18:57 + [2025-09-09 20:05:52] iteration 1931/ 11920 | consumed samples: 1977344 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.163225E+00 | loss scale: 1.0 | grad norm: 0.121 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:36:00.124962 | finish at 2025-09-10 11:41:53 + [2025-09-09 20:05:58] iteration 1932/ 11920 | consumed samples: 1978368 | elapsed time per iteration (ms): 5627.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.141529E+00 | loss scale: 1.0 | grad norm: 0.109 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:36:48.806495 | finish at 2025-09-10 11:42:47 + [2025-09-09 20:06:04] iteration 1933/ 11920 | consumed samples: 1979392 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.145104E+00 | loss scale: 1.0 | grad norm: 0.116 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:35:50.530661 | finish at 2025-09-10 11:41:54 + [2025-09-09 20:06:09] iteration 1934/ 11920 | consumed samples: 1980416 | elapsed time per iteration (ms): 5634.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.159815E+00 | loss scale: 1.0 | grad norm: 0.115 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:37:41.588894 | finish at 2025-09-10 11:43:51 + [2025-09-09 20:06:15] iteration 1935/ 11920 | consumed samples: 1981440 | elapsed time per iteration (ms): 5630.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.125436E+00 | loss scale: 1.0 | grad norm: 0.119 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:36:57.312794 | finish at 2025-09-10 11:43:12 + [2025-09-09 20:06:21] iteration 1936/ 11920 | consumed samples: 1982464 | elapsed time per iteration (ms): 5634.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.158309E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:37:33.343872 | finish at 2025-09-10 11:43:54 + [2025-09-09 20:06:26] iteration 1937/ 11920 | consumed samples: 1983488 | elapsed time per iteration (ms): 5627.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.146616E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:36:18.633312 | finish at 2025-09-10 11:42:45 + [2025-09-09 20:06:32] iteration 1938/ 11920 | consumed samples: 1984512 | elapsed time per iteration (ms): 5960.1 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.154540E+00 | loss scale: 1.0 | grad norm: 0.249 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:31:33.472427 | finish at 2025-09-10 12:38:06 + [2025-09-09 20:06:38] iteration 1939/ 11920 | consumed samples: 1985536 | elapsed time per iteration (ms): 5637.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.153395E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:37:50.745943 | finish at 2025-09-10 11:44:29 + [2025-09-09 20:06:44] iteration 1940/ 11920 | consumed samples: 1986560 | elapsed time per iteration (ms): 5639.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.140796E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:38:03.974557 | finish at 2025-09-10 11:44:47 + [2025-09-09 20:06:49] iteration 1941/ 11920 | consumed samples: 1987584 | elapsed time per iteration (ms): 5635.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.151646E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:37:18.267127 | finish at 2025-09-10 11:44:07 + [2025-09-09 20:06:55] iteration 1942/ 11920 | consumed samples: 1988608 | elapsed time per iteration (ms): 5633.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.153447E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:36:53.383457 | finish at 2025-09-10 11:43:48 + [2025-09-09 20:07:00] iteration 1943/ 11920 | consumed samples: 1989632 | elapsed time per iteration (ms): 5624.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.148359E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:35:16.433727 | finish at 2025-09-10 11:42:17 + [2025-09-09 20:07:06] iteration 1944/ 11920 | consumed samples: 1990656 | elapsed time per iteration (ms): 5841.7 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.136222E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:11:16.764292 | finish at 2025-09-10 12:18:23 + [2025-09-09 20:07:12] iteration 1945/ 11920 | consumed samples: 1991680 | elapsed time per iteration (ms): 5629.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.142634E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:35:58.383089 | finish at 2025-09-10 11:43:10 + [2025-09-09 20:07:18] iteration 1946/ 11920 | consumed samples: 1992704 | elapsed time per iteration (ms): 5634.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.144348E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:36:42.743217 | finish at 2025-09-10 11:44:00 + [2025-09-09 20:07:23] iteration 1947/ 11920 | consumed samples: 1993728 | elapsed time per iteration (ms): 5629.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.146588E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:35:39.343270 | finish at 2025-09-10 11:43:02 + [2025-09-09 20:07:29] iteration 1948/ 11920 | consumed samples: 1994752 | elapsed time per iteration (ms): 5627.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.131423E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:35:14.598956 | finish at 2025-09-10 11:42:43 + [2025-09-09 20:07:34] iteration 1949/ 11920 | consumed samples: 1995776 | elapsed time per iteration (ms): 5630.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.149807E+00 | loss scale: 1.0 | grad norm: 0.142 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:35:44.421615 | finish at 2025-09-10 11:43:19 + [2025-09-09 20:07:40] iteration 1950/ 11920 | consumed samples: 1996800 | elapsed time per iteration (ms): 5634.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.152126E+00 | loss scale: 1.0 | grad norm: 0.142 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:36:20.310483 | finish at 2025-09-10 11:44:00 + [2025-09-09 20:07:46] iteration 1951/ 11920 | consumed samples: 1997824 | elapsed time per iteration (ms): 5635.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.145878E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:36:16.992922 | finish at 2025-09-10 11:44:03 + [2025-09-09 20:07:51] iteration 1952/ 11920 | consumed samples: 1998848 | elapsed time per iteration (ms): 5624.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.142709E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:34:26.097694 | finish at 2025-09-10 11:42:17 + [2025-09-09 20:07:57] iteration 1953/ 11920 | consumed samples: 1999872 | elapsed time per iteration (ms): 5628.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.120420E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:35:02.493517 | finish at 2025-09-10 11:42:59 + [2025-09-09 20:08:03] iteration 1954/ 11920 | consumed samples: 2000896 | elapsed time per iteration (ms): 5629.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.141110E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:34:58.278460 | finish at 2025-09-10 11:43:01 + [2025-09-09 20:08:09] iteration 1955/ 11920 | consumed samples: 2001920 | elapsed time per iteration (ms): 6004.3 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.145409E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:37:12.843543 | finish at 2025-09-10 12:45:21 + [2025-09-09 20:08:14] iteration 1956/ 11920 | consumed samples: 2002944 | elapsed time per iteration (ms): 5830.3 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.153330E+00 | loss scale: 1.0 | grad norm: 0.241 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:08:12.965212 | finish at 2025-09-10 12:16:27 + [2025-09-09 20:08:20] iteration 1957/ 11920 | consumed samples: 2003968 | elapsed time per iteration (ms): 5628.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.145967E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:34:40.662324 | finish at 2025-09-10 11:43:01 + [2025-09-09 20:08:26] iteration 1958/ 11920 | consumed samples: 2004992 | elapsed time per iteration (ms): 5973.0 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.152625E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:31:43.299634 | finish at 2025-09-10 12:40:09 + [2025-09-09 20:08:32] iteration 1959/ 11920 | consumed samples: 2006016 | elapsed time per iteration (ms): 5856.7 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.135155E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:12:18.934871 | finish at 2025-09-10 12:20:51 + [2025-09-09 20:08:37] iteration 1960/ 11920 | consumed samples: 2007040 | elapsed time per iteration (ms): 5643.0 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.152873E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:36:44.058037 | finish at 2025-09-10 11:45:22 + [2025-09-09 20:08:43] iteration 1961/ 11920 | consumed samples: 2008064 | elapsed time per iteration (ms): 5843.0 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.148368E+00 | loss scale: 1.0 | grad norm: 0.284 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:09:50.467221 | finish at 2025-09-10 12:18:34 + [2025-09-09 20:08:49] iteration 1962/ 11920 | consumed samples: 2009088 | elapsed time per iteration (ms): 5628.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.151338E+00 | loss scale: 1.0 | grad norm: 0.261 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:34:11.931437 | finish at 2025-09-10 11:43:01 + [2025-09-09 20:08:55] iteration 1963/ 11920 | consumed samples: 2010112 | elapsed time per iteration (ms): 5843.3 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.145019E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:09:42.173567 | finish at 2025-09-10 12:18:37 + [2025-09-09 20:09:01] iteration 1964/ 11920 | consumed samples: 2011136 | elapsed time per iteration (ms): 5947.9 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.138969E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:26:57.238371 | finish at 2025-09-10 12:35:58 + [2025-09-09 20:09:06] iteration 1965/ 11920 | consumed samples: 2012160 | elapsed time per iteration (ms): 5636.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.130458E+00 | loss scale: 1.0 | grad norm: 0.135 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:35:06.550072 | finish at 2025-09-10 11:44:13 + [2025-09-09 20:09:12] iteration 1966/ 11920 | consumed samples: 2013184 | elapsed time per iteration (ms): 5630.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.137697E+00 | loss scale: 1.0 | grad norm: 0.127 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:34:05.226482 | finish at 2025-09-10 11:43:17 + [2025-09-09 20:09:18] iteration 1967/ 11920 | consumed samples: 2014208 | elapsed time per iteration (ms): 5631.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.132221E+00 | loss scale: 1.0 | grad norm: 0.114 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:34:06.126501 | finish at 2025-09-10 11:43:24 + [2025-09-09 20:09:23] iteration 1968/ 11920 | consumed samples: 2015232 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.141518E+00 | loss scale: 1.0 | grad norm: 0.115 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:33:43.209999 | finish at 2025-09-10 11:43:06 + [2025-09-09 20:09:29] iteration 1969/ 11920 | consumed samples: 2016256 | elapsed time per iteration (ms): 5642.6 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.137262E+00 | loss scale: 1.0 | grad norm: 0.107 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:35:49.083769 | finish at 2025-09-10 11:45:18 + [2025-09-09 20:09:35] iteration 1970/ 11920 | consumed samples: 2017280 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.135409E+00 | loss scale: 1.0 | grad norm: 0.109 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:32:31.482284 | finish at 2025-09-10 11:42:06 + [2025-09-09 20:09:40] iteration 1971/ 11920 | consumed samples: 2018304 | elapsed time per iteration (ms): 5635.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.136025E+00 | loss scale: 1.0 | grad norm: 0.112 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:34:22.652858 | finish at 2025-09-10 11:44:03 + [2025-09-09 20:09:46] iteration 1972/ 11920 | consumed samples: 2019328 | elapsed time per iteration (ms): 5629.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.119255E+00 | loss scale: 1.0 | grad norm: 0.113 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:33:18.171421 | finish at 2025-09-10 11:43:04 + [2025-09-09 20:09:51] iteration 1973/ 11920 | consumed samples: 2020352 | elapsed time per iteration (ms): 5633.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.133512E+00 | loss scale: 1.0 | grad norm: 0.111 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:33:59.413639 | finish at 2025-09-10 11:43:51 + [2025-09-09 20:09:57] iteration 1974/ 11920 | consumed samples: 2021376 | elapsed time per iteration (ms): 5924.3 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.136459E+00 | loss scale: 1.0 | grad norm: 0.124 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:22:02.831254 | finish at 2025-09-10 12:32:00 + [2025-09-09 20:10:03] iteration 1975/ 11920 | consumed samples: 2022400 | elapsed time per iteration (ms): 5640.7 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.139216E+00 | loss scale: 1.0 | grad norm: 0.129 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:34:56.297783 | finish at 2025-09-10 11:44:59 + [2025-09-09 20:10:09] iteration 1976/ 11920 | consumed samples: 2023424 | elapsed time per iteration (ms): 5636.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.144330E+00 | loss scale: 1.0 | grad norm: 0.137 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:34:04.760147 | finish at 2025-09-10 11:44:13 + [2025-09-09 20:10:15] iteration 1977/ 11920 | consumed samples: 2024448 | elapsed time per iteration (ms): 5863.5 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.137721E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:11:40.724782 | finish at 2025-09-10 12:21:55 + [2025-09-09 20:10:20] iteration 1978/ 11920 | consumed samples: 2025472 | elapsed time per iteration (ms): 5635.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.138695E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:33:45.298486 | finish at 2025-09-10 11:44:05 + [2025-09-09 20:10:26] iteration 1979/ 11920 | consumed samples: 2026496 | elapsed time per iteration (ms): 5642.0 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.148814E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:34:47.567184 | finish at 2025-09-10 11:45:13 + [2025-09-09 20:10:32] iteration 1980/ 11920 | consumed samples: 2027520 | elapsed time per iteration (ms): 5993.0 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.138517E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:32:50.901389 | finish at 2025-09-10 12:43:23 + [2025-09-09 20:10:38] iteration 1981/ 11920 | consumed samples: 2028544 | elapsed time per iteration (ms): 6004.6 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.126142E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:34:39.606135 | finish at 2025-09-10 12:45:17 + [2025-09-09 20:10:44] iteration 1982/ 11920 | consumed samples: 2029568 | elapsed time per iteration (ms): 5909.2 | throughput per GPU (TFLOP/s/GPU): 76.4 | MFU 7.73% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.134396E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:18:45.866074 | finish at 2025-09-10 12:29:30 + [2025-09-09 20:10:50] iteration 1983/ 11920 | consumed samples: 2030592 | elapsed time per iteration (ms): 6138.1 | throughput per GPU (TFLOP/s/GPU): 73.6 | MFU 7.44% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.127646E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:56:34.249042 | finish at 2025-09-10 13:07:24 + [2025-09-09 20:10:56] iteration 1984/ 11920 | consumed samples: 2031616 | elapsed time per iteration (ms): 6201.5 | throughput per GPU (TFLOP/s/GPU): 72.8 | MFU 7.36% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.140831E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:06:57.705414 | finish at 2025-09-10 13:17:54 + [2025-09-09 20:11:02] iteration 1985/ 11920 | consumed samples: 2032640 | elapsed time per iteration (ms): 5890.3 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.137973E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:15:20.355878 | finish at 2025-09-10 12:26:22 + [2025-09-09 20:11:08] iteration 1986/ 11920 | consumed samples: 2033664 | elapsed time per iteration (ms): 5635.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.122294E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:33:03.236547 | finish at 2025-09-10 11:44:11 + [2025-09-09 20:11:13] iteration 1987/ 11920 | consumed samples: 2034688 | elapsed time per iteration (ms): 5630.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.134973E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:32:02.338808 | finish at 2025-09-10 11:43:16 + [2025-09-09 20:11:19] iteration 1988/ 11920 | consumed samples: 2035712 | elapsed time per iteration (ms): 5633.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.120792E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:32:32.496035 | finish at 2025-09-10 11:43:51 + [2025-09-09 20:11:24] iteration 1989/ 11920 | consumed samples: 2036736 | elapsed time per iteration (ms): 5631.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.117285E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:32:08.024778 | finish at 2025-09-10 11:43:32 + [2025-09-09 20:11:30] iteration 1990/ 11920 | consumed samples: 2037760 | elapsed time per iteration (ms): 5988.0 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.141512E+00 | loss scale: 1.0 | grad norm: 0.271 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:31:01.104326 | finish at 2025-09-10 12:42:32 + [2025-09-09 20:11:36] iteration 1991/ 11920 | consumed samples: 2038784 | elapsed time per iteration (ms): 5643.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.153353E+00 | loss scale: 1.0 | grad norm: 0.265 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:33:50.032387 | finish at 2025-09-10 11:45:26 + [2025-09-09 20:11:42] iteration 1992/ 11920 | consumed samples: 2039808 | elapsed time per iteration (ms): 5636.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.135909E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:32:36.148142 | finish at 2025-09-10 11:44:18 + [2025-09-09 20:11:47] iteration 1993/ 11920 | consumed samples: 2040832 | elapsed time per iteration (ms): 5628.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.128136E+00 | loss scale: 1.0 | grad norm: 0.242 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:31:09.291115 | finish at 2025-09-10 11:42:57 + [2025-09-09 20:11:53] iteration 1994/ 11920 | consumed samples: 2041856 | elapsed time per iteration (ms): 5630.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.144420E+00 | loss scale: 1.0 | grad norm: 0.253 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:31:31.635637 | finish at 2025-09-10 11:43:25 + [2025-09-09 20:11:59] iteration 1995/ 11920 | consumed samples: 2042880 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.131378E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:30:11.593997 | finish at 2025-09-10 11:42:10 + [2025-09-09 20:12:04] iteration 1996/ 11920 | consumed samples: 2043904 | elapsed time per iteration (ms): 5639.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.129641E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:32:48.976047 | finish at 2025-09-10 11:44:53 + [2025-09-09 20:12:10] iteration 1997/ 11920 | consumed samples: 2044928 | elapsed time per iteration (ms): 5645.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.138433E+00 | loss scale: 1.0 | grad norm: 0.253 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:33:35.843464 | finish at 2025-09-10 11:45:46 + [2025-09-09 20:12:16] iteration 1998/ 11920 | consumed samples: 2045952 | elapsed time per iteration (ms): 5635.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.137069E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:31:52.485387 | finish at 2025-09-10 11:44:08 + [2025-09-09 20:12:21] iteration 1999/ 11920 | consumed samples: 2046976 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.138292E+00 | loss scale: 1.0 | grad norm: 0.271 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:30:13.863517 | finish at 2025-09-10 11:42:35 + [2025-09-09 20:12:27] iteration 2000/ 11920 | consumed samples: 2048000 | elapsed time per iteration (ms): 5628.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.144872E+00 | loss scale: 1.0 | grad norm: 0.290 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:30:32.577057 | finish at 2025-09-10 11:42:59 + [2025-09-09 20:12:32] iteration 2001/ 11920 | consumed samples: 2049024 | elapsed time per iteration (ms): 5649.8 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.119509E+00 | loss scale: 1.0 | grad norm: 0.260 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:33:59.877288 | finish at 2025-09-10 11:46:32 + [2025-09-09 20:12:38] iteration 2002/ 11920 | consumed samples: 2050048 | elapsed time per iteration (ms): 5670.4 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.128894E+00 | loss scale: 1.0 | grad norm: 0.247 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:37:19.132659 | finish at 2025-09-10 11:49:57 + [2025-09-09 20:12:44] iteration 2003/ 11920 | consumed samples: 2051072 | elapsed time per iteration (ms): 5648.0 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.131035E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:33:31.145087 | finish at 2025-09-10 11:46:15 + [2025-09-09 20:12:49] iteration 2004/ 11920 | consumed samples: 2052096 | elapsed time per iteration (ms): 5639.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.131501E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:32:02.668797 | finish at 2025-09-10 11:44:52 + [2025-09-09 20:12:55] iteration 2005/ 11920 | consumed samples: 2053120 | elapsed time per iteration (ms): 5639.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.139482E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:31:50.594566 | finish at 2025-09-10 11:44:46 + [2025-09-09 20:13:01] iteration 2006/ 11920 | consumed samples: 2054144 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.153252E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:29:26.668372 | finish at 2025-09-10 11:42:27 + [2025-09-09 20:13:07] iteration 2007/ 11920 | consumed samples: 2055168 | elapsed time per iteration (ms): 5905.8 | throughput per GPU (TFLOP/s/GPU): 76.4 | MFU 7.73% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.129371E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:15:44.368964 | finish at 2025-09-10 12:28:51 + [2025-09-09 20:13:13] iteration 2008/ 11920 | consumed samples: 2056192 | elapsed time per iteration (ms): 6005.8 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.134070E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:32:09.711828 | finish at 2025-09-10 12:45:22 + [2025-09-09 20:13:19] iteration 2009/ 11920 | consumed samples: 2057216 | elapsed time per iteration (ms): 5965.8 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.129058E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:25:26.900124 | finish at 2025-09-10 12:38:45 + [2025-09-09 20:13:25] iteration 2010/ 11920 | consumed samples: 2058240 | elapsed time per iteration (ms): 6221.8 | throughput per GPU (TFLOP/s/GPU): 72.6 | MFU 7.34% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.127212E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:07:38.324771 | finish at 2025-09-10 13:21:03 + [2025-09-09 20:13:30] iteration 2011/ 11920 | consumed samples: 2059264 | elapsed time per iteration (ms): 5626.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.118859E+00 | loss scale: 1.0 | grad norm: 0.142 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:29:11.822715 | finish at 2025-09-10 11:42:42 + [2025-09-09 20:13:36] iteration 2012/ 11920 | consumed samples: 2060288 | elapsed time per iteration (ms): 5627.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.129091E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:29:13.242929 | finish at 2025-09-10 11:42:49 + [2025-09-09 20:13:42] iteration 2013/ 11920 | consumed samples: 2061312 | elapsed time per iteration (ms): 5992.3 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.135608E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:29:25.983340 | finish at 2025-09-10 12:43:08 + [2025-09-09 20:13:48] iteration 2014/ 11920 | consumed samples: 2062336 | elapsed time per iteration (ms): 5626.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.129699E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:28:57.005397 | finish at 2025-09-10 11:42:45 + [2025-09-09 20:13:53] iteration 2015/ 11920 | consumed samples: 2063360 | elapsed time per iteration (ms): 5630.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.128220E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:29:26.088663 | finish at 2025-09-10 11:43:19 + [2025-09-09 20:13:59] iteration 2016/ 11920 | consumed samples: 2064384 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.123120E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:28:27.846497 | finish at 2025-09-10 11:42:27 + [2025-09-09 20:14:05] iteration 2017/ 11920 | consumed samples: 2065408 | elapsed time per iteration (ms): 5635.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.134407E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:30:03.692955 | finish at 2025-09-10 11:44:08 + [2025-09-09 20:14:10] iteration 2018/ 11920 | consumed samples: 2066432 | elapsed time per iteration (ms): 5630.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.110833E+00 | loss scale: 1.0 | grad norm: 0.137 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:29:14.342607 | finish at 2025-09-10 11:43:24 + [2025-09-09 20:14:16] iteration 2019/ 11920 | consumed samples: 2067456 | elapsed time per iteration (ms): 5636.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.122166E+00 | loss scale: 1.0 | grad norm: 0.136 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:30:10.183918 | finish at 2025-09-10 11:44:26 + [2025-09-09 20:14:22] iteration 2020/ 11920 | consumed samples: 2068480 | elapsed time per iteration (ms): 5879.9 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.114071E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:10:11.335516 | finish at 2025-09-10 12:24:33 + [2025-09-09 20:14:27] iteration 2021/ 11920 | consumed samples: 2069504 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.125446E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:28:05.139026 | finish at 2025-09-10 11:42:32 + [2025-09-09 20:14:33] iteration 2022/ 11920 | consumed samples: 2070528 | elapsed time per iteration (ms): 5631.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.118383E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:28:55.159362 | finish at 2025-09-10 11:43:28 + [2025-09-09 20:14:39] iteration 2023/ 11920 | consumed samples: 2071552 | elapsed time per iteration (ms): 5934.4 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.127796E+00 | loss scale: 1.0 | grad norm: 0.128 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:18:52.516926 | finish at 2025-09-10 12:33:31 + [2025-09-09 20:14:44] iteration 2024/ 11920 | consumed samples: 2072576 | elapsed time per iteration (ms): 5635.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.109207E+00 | loss scale: 1.0 | grad norm: 0.136 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:29:33.192200 | finish at 2025-09-10 11:44:18 + [2025-09-09 20:14:50] iteration 2025/ 11920 | consumed samples: 2073600 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.119006E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:27:37.242327 | finish at 2025-09-10 11:42:27 + [2025-09-09 20:14:56] iteration 2026/ 11920 | consumed samples: 2074624 | elapsed time per iteration (ms): 5881.1 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.110453E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:09:47.230092 | finish at 2025-09-10 12:24:43 + [2025-09-09 20:15:02] iteration 2027/ 11920 | consumed samples: 2075648 | elapsed time per iteration (ms): 5628.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.109337E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:28:07.087952 | finish at 2025-09-10 11:43:09 + [2025-09-09 20:15:07] iteration 2028/ 11920 | consumed samples: 2076672 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.119838E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:27:21.830204 | finish at 2025-09-10 11:42:29 + [2025-09-09 20:15:13] iteration 2029/ 11920 | consumed samples: 2077696 | elapsed time per iteration (ms): 5627.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.123216E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:27:37.242758 | finish at 2025-09-10 11:42:50 + [2025-09-09 20:15:19] iteration 2030/ 11920 | consumed samples: 2078720 | elapsed time per iteration (ms): 5625.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.121986E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:27:19.861269 | finish at 2025-09-10 11:42:38 + [2025-09-09 20:15:24] iteration 2031/ 11920 | consumed samples: 2079744 | elapsed time per iteration (ms): 5630.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.124408E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:28:04.032830 | finish at 2025-09-10 11:43:28 + [2025-09-09 20:15:30] iteration 2032/ 11920 | consumed samples: 2080768 | elapsed time per iteration (ms): 5636.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.115527E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:28:52.270409 | finish at 2025-09-10 11:44:22 + [2025-09-09 20:15:35] iteration 2033/ 11920 | consumed samples: 2081792 | elapsed time per iteration (ms): 5629.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.107618E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:27:41.795687 | finish at 2025-09-10 11:43:17 + [2025-09-09 20:15:41] iteration 2034/ 11920 | consumed samples: 2082816 | elapsed time per iteration (ms): 5635.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.127687E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:28:30.285108 | finish at 2025-09-10 11:44:11 + [2025-09-09 20:15:47] iteration 2035/ 11920 | consumed samples: 2083840 | elapsed time per iteration (ms): 5631.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.122548E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:27:45.572273 | finish at 2025-09-10 11:43:32 + [2025-09-09 20:15:52] iteration 2036/ 11920 | consumed samples: 2084864 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.121295E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:26:22.191986 | finish at 2025-09-10 11:42:14 + [2025-09-09 20:15:58] iteration 2037/ 11920 | consumed samples: 2085888 | elapsed time per iteration (ms): 5848.5 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.125193E+00 | loss scale: 1.0 | grad norm: 0.241 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:03:20.480578 | finish at 2025-09-10 12:19:19 + [2025-09-09 20:16:04] iteration 2038/ 11920 | consumed samples: 2086912 | elapsed time per iteration (ms): 5634.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.107320E+00 | loss scale: 1.0 | grad norm: 0.270 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:27:58.743905 | finish at 2025-09-10 11:44:03 + [2025-09-09 20:16:09] iteration 2039/ 11920 | consumed samples: 2087936 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.118428E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:25:44.941486 | finish at 2025-09-10 11:41:54 + [2025-09-09 20:16:15] iteration 2040/ 11920 | consumed samples: 2088960 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.127787E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:25:26.331453 | finish at 2025-09-10 11:41:41 + [2025-09-09 20:16:21] iteration 2041/ 11920 | consumed samples: 2089984 | elapsed time per iteration (ms): 5618.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.131105E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:25:07.811198 | finish at 2025-09-10 11:41:28 + [2025-09-09 20:16:26] iteration 2042/ 11920 | consumed samples: 2091008 | elapsed time per iteration (ms): 5628.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.108228E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:26:35.647457 | finish at 2025-09-10 11:43:02 + [2025-09-09 20:16:32] iteration 2043/ 11920 | consumed samples: 2092032 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.117012E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:25:29.692417 | finish at 2025-09-10 11:42:02 + [2025-09-09 20:16:38] iteration 2044/ 11920 | consumed samples: 2093056 | elapsed time per iteration (ms): 5635.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.121221E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:27:31.151593 | finish at 2025-09-10 11:44:09 + [2025-09-09 20:16:43] iteration 2045/ 11920 | consumed samples: 2094080 | elapsed time per iteration (ms): 5645.0 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.108999E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:29:04.476050 | finish at 2025-09-10 11:45:48 + [2025-09-09 20:16:49] iteration 2046/ 11920 | consumed samples: 2095104 | elapsed time per iteration (ms): 5631.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.111696E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:26:41.323073 | finish at 2025-09-10 11:43:30 + [2025-09-09 20:16:54] iteration 2047/ 11920 | consumed samples: 2096128 | elapsed time per iteration (ms): 5630.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.133609E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:26:33.665276 | finish at 2025-09-10 11:43:28 + [2025-09-09 20:17:00] iteration 2048/ 11920 | consumed samples: 2097152 | elapsed time per iteration (ms): 5631.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.139115E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:26:34.278679 | finish at 2025-09-10 11:43:34 + [2025-09-09 20:17:06] iteration 2049/ 11920 | consumed samples: 2098176 | elapsed time per iteration (ms): 5635.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.120717E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:27:09.425046 | finish at 2025-09-10 11:44:15 + [2025-09-09 20:17:11] iteration 2050/ 11920 | consumed samples: 2099200 | elapsed time per iteration (ms): 5623.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.137537E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:24:59.063201 | finish at 2025-09-10 11:42:10 + [2025-09-09 20:17:17] iteration 2051/ 11920 | consumed samples: 2100224 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.128767E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:25:10.376751 | finish at 2025-09-10 11:42:27 + [2025-09-09 20:17:23] iteration 2052/ 11920 | consumed samples: 2101248 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.119136E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:25:20.602267 | finish at 2025-09-10 11:42:43 + [2025-09-09 20:17:28] iteration 2053/ 11920 | consumed samples: 2102272 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.109109E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:24:41.003832 | finish at 2025-09-10 11:42:09 + [2025-09-09 20:17:34] iteration 2054/ 11920 | consumed samples: 2103296 | elapsed time per iteration (ms): 5624.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.123352E+00 | loss scale: 1.0 | grad norm: 0.134 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:24:52.524055 | finish at 2025-09-10 11:42:26 + [2025-09-09 20:17:39] iteration 2055/ 11920 | consumed samples: 2104320 | elapsed time per iteration (ms): 5635.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.108653E+00 | loss scale: 1.0 | grad norm: 0.120 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:26:32.186681 | finish at 2025-09-10 11:44:12 + [2025-09-09 20:17:45] iteration 2056/ 11920 | consumed samples: 2105344 | elapsed time per iteration (ms): 5628.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.123299E+00 | loss scale: 1.0 | grad norm: 0.120 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:25:22.193098 | finish at 2025-09-10 11:43:07 + [2025-09-09 20:17:51] iteration 2057/ 11920 | consumed samples: 2106368 | elapsed time per iteration (ms): 5632.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.110084E+00 | loss scale: 1.0 | grad norm: 0.119 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:25:48.754318 | finish at 2025-09-10 11:43:39 + [2025-09-09 20:17:56] iteration 2058/ 11920 | consumed samples: 2107392 | elapsed time per iteration (ms): 5628.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.117326E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:25:04.067456 | finish at 2025-09-10 11:43:00 + [2025-09-09 20:18:02] iteration 2059/ 11920 | consumed samples: 2108416 | elapsed time per iteration (ms): 5632.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.115827E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:25:37.586643 | finish at 2025-09-10 11:43:40 + [2025-09-09 20:18:08] iteration 2060/ 11920 | consumed samples: 2109440 | elapsed time per iteration (ms): 5628.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.115955E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:24:58.453245 | finish at 2025-09-10 11:43:06 + [2025-09-09 20:18:13] iteration 2061/ 11920 | consumed samples: 2110464 | elapsed time per iteration (ms): 5630.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.107243E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:25:11.955878 | finish at 2025-09-10 11:43:25 + [2025-09-09 20:18:19] iteration 2062/ 11920 | consumed samples: 2111488 | elapsed time per iteration (ms): 5627.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.099945E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:24:37.684165 | finish at 2025-09-10 11:42:57 + [2025-09-09 20:18:24] iteration 2063/ 11920 | consumed samples: 2112512 | elapsed time per iteration (ms): 5628.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.117170E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:24:44.425018 | finish at 2025-09-10 11:43:09 + [2025-09-09 20:18:30] iteration 2064/ 11920 | consumed samples: 2113536 | elapsed time per iteration (ms): 5629.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.111990E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:24:47.746674 | finish at 2025-09-10 11:43:18 + [2025-09-09 20:18:36] iteration 2065/ 11920 | consumed samples: 2114560 | elapsed time per iteration (ms): 5637.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.123997E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:25:59.926683 | finish at 2025-09-10 11:44:36 + [2025-09-09 20:18:41] iteration 2066/ 11920 | consumed samples: 2115584 | elapsed time per iteration (ms): 5629.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.122895E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:24:28.541393 | finish at 2025-09-10 11:43:10 + [2025-09-09 20:18:47] iteration 2067/ 11920 | consumed samples: 2116608 | elapsed time per iteration (ms): 5885.1 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.110715E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:06:26.201630 | finish at 2025-09-10 12:25:13 + [2025-09-09 20:18:53] iteration 2068/ 11920 | consumed samples: 2117632 | elapsed time per iteration (ms): 5987.4 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.117996E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:23:07.426097 | finish at 2025-09-10 12:42:01 + [2025-09-09 20:18:59] iteration 2069/ 11920 | consumed samples: 2118656 | elapsed time per iteration (ms): 5628.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.118418E+00 | loss scale: 1.0 | grad norm: 0.240 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:24:09.237506 | finish at 2025-09-10 11:43:08 + [2025-09-09 20:19:05] iteration 2070/ 11920 | consumed samples: 2119680 | elapsed time per iteration (ms): 5997.5 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.111521E+00 | loss scale: 1.0 | grad norm: 0.255 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:24:35.104368 | finish at 2025-09-10 12:43:40 + [2025-09-09 20:19:11] iteration 2071/ 11920 | consumed samples: 2120704 | elapsed time per iteration (ms): 5628.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.134766E+00 | loss scale: 1.0 | grad norm: 0.247 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:23:57.143967 | finish at 2025-09-10 11:43:08 + [2025-09-09 20:19:16] iteration 2072/ 11920 | consumed samples: 2121728 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.126100E+00 | loss scale: 1.0 | grad norm: 0.253 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:23:23.044064 | finish at 2025-09-10 11:42:39 + [2025-09-09 20:19:22] iteration 2073/ 11920 | consumed samples: 2122752 | elapsed time per iteration (ms): 5633.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.107619E+00 | loss scale: 1.0 | grad norm: 0.240 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:24:31.098707 | finish at 2025-09-10 11:43:53 + [2025-09-09 20:19:27] iteration 2074/ 11920 | consumed samples: 2123776 | elapsed time per iteration (ms): 5632.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.137199E+00 | loss scale: 1.0 | grad norm: 0.292 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:24:18.815027 | finish at 2025-09-10 11:43:46 + [2025-09-09 20:19:33] iteration 2075/ 11920 | consumed samples: 2124800 | elapsed time per iteration (ms): 5630.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.125846E+00 | loss scale: 1.0 | grad norm: 0.302 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:23:52.686383 | finish at 2025-09-10 11:43:26 + [2025-09-09 20:19:39] iteration 2076/ 11920 | consumed samples: 2125824 | elapsed time per iteration (ms): 5638.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.130936E+00 | loss scale: 1.0 | grad norm: 0.247 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:25:05.804480 | finish at 2025-09-10 11:44:44 + [2025-09-09 20:19:44] iteration 2077/ 11920 | consumed samples: 2126848 | elapsed time per iteration (ms): 5637.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.120792E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:24:45.623103 | finish at 2025-09-10 11:44:30 + [2025-09-09 20:19:50] iteration 2078/ 11920 | consumed samples: 2127872 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.112542E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:22:41.393137 | finish at 2025-09-10 11:42:31 + [2025-09-09 20:19:56] iteration 2079/ 11920 | consumed samples: 2128896 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.120422E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:22:59.754115 | finish at 2025-09-10 11:42:55 + [2025-09-09 20:20:01] iteration 2080/ 11920 | consumed samples: 2129920 | elapsed time per iteration (ms): 5631.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.118624E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:23:29.509621 | finish at 2025-09-10 11:43:31 + [2025-09-09 20:20:07] iteration 2081/ 11920 | consumed samples: 2130944 | elapsed time per iteration (ms): 5635.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.131419E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:24:08.073452 | finish at 2025-09-10 11:44:15 + [2025-09-09 20:20:13] iteration 2082/ 11920 | consumed samples: 2131968 | elapsed time per iteration (ms): 5854.7 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.118980E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:59:58.218129 | finish at 2025-09-10 12:20:11 + [2025-09-09 20:20:18] iteration 2083/ 11920 | consumed samples: 2132992 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.112473E+00 | loss scale: 1.0 | grad norm: 0.128 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:22:12.827144 | finish at 2025-09-10 11:42:31 + [2025-09-09 20:20:24] iteration 2084/ 11920 | consumed samples: 2134016 | elapsed time per iteration (ms): 5838.0 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.107211E+00 | loss scale: 1.0 | grad norm: 0.126 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:57:02.336418 | finish at 2025-09-10 12:17:26 + [2025-09-09 20:20:30] iteration 2085/ 11920 | consumed samples: 2135040 | elapsed time per iteration (ms): 5630.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.098629E+00 | loss scale: 1.0 | grad norm: 0.120 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:22:56.929656 | finish at 2025-09-10 11:43:27 + [2025-09-09 20:20:35] iteration 2086/ 11920 | consumed samples: 2136064 | elapsed time per iteration (ms): 5634.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.117190E+00 | loss scale: 1.0 | grad norm: 0.119 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:23:28.805758 | finish at 2025-09-10 11:44:04 + [2025-09-09 20:20:41] iteration 2087/ 11920 | consumed samples: 2137088 | elapsed time per iteration (ms): 5633.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.096728E+00 | loss scale: 1.0 | grad norm: 0.124 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:23:09.456782 | finish at 2025-09-10 11:43:51 + [2025-09-09 20:20:47] iteration 2088/ 11920 | consumed samples: 2138112 | elapsed time per iteration (ms): 5629.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.114954E+00 | loss scale: 1.0 | grad norm: 0.121 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:22:25.370632 | finish at 2025-09-10 11:43:12 + [2025-09-09 20:20:52] iteration 2089/ 11920 | consumed samples: 2139136 | elapsed time per iteration (ms): 5628.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.104537E+00 | loss scale: 1.0 | grad norm: 0.135 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:22:12.974707 | finish at 2025-09-10 11:43:05 + [2025-09-09 20:20:58] iteration 2090/ 11920 | consumed samples: 2140160 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.119520E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:21:44.151139 | finish at 2025-09-10 11:42:42 + [2025-09-09 20:21:04] iteration 2091/ 11920 | consumed samples: 2141184 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.119760E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:21:05.855516 | finish at 2025-09-10 11:42:09 + [2025-09-09 20:21:09] iteration 2092/ 11920 | consumed samples: 2142208 | elapsed time per iteration (ms): 5637.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.103470E+00 | loss scale: 1.0 | grad norm: 0.132 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:23:27.848296 | finish at 2025-09-10 11:44:37 + [2025-09-09 20:21:15] iteration 2093/ 11920 | consumed samples: 2143232 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.118996E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:21:15.909709 | finish at 2025-09-10 11:42:31 + [2025-09-09 20:21:20] iteration 2094/ 11920 | consumed samples: 2144256 | elapsed time per iteration (ms): 5629.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.109363E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:21:52.875111 | finish at 2025-09-10 11:43:13 + [2025-09-09 20:21:26] iteration 2095/ 11920 | consumed samples: 2145280 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.110644E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:20:53.331757 | finish at 2025-09-10 11:42:19 + [2025-09-09 20:21:32] iteration 2096/ 11920 | consumed samples: 2146304 | elapsed time per iteration (ms): 5959.1 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.125303E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:15:42.524048 | finish at 2025-09-10 12:37:15 + [2025-09-09 20:21:38] iteration 2097/ 11920 | consumed samples: 2147328 | elapsed time per iteration (ms): 6328.6 | throughput per GPU (TFLOP/s/GPU): 71.3 | MFU 7.21% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.117408E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:16:06.057257 | finish at 2025-09-10 13:37:44 + [2025-09-09 20:21:44] iteration 2098/ 11920 | consumed samples: 2148352 | elapsed time per iteration (ms): 5880.7 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.081713E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:02:40.295038 | finish at 2025-09-10 12:24:25 + [2025-09-09 20:21:50] iteration 2099/ 11920 | consumed samples: 2149376 | elapsed time per iteration (ms): 5911.2 | throughput per GPU (TFLOP/s/GPU): 76.4 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.107996E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:07:33.907365 | finish at 2025-09-10 12:29:24 + [2025-09-09 20:21:56] iteration 2100/ 11920 | consumed samples: 2150400 | elapsed time per iteration (ms): 5834.6 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.095851E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:54:55.530572 | finish at 2025-09-10 12:16:52 + [2025-09-09 20:22:02] iteration 2101/ 11920 | consumed samples: 2151424 | elapsed time per iteration (ms): 6072.6 | throughput per GPU (TFLOP/s/GPU): 74.3 | MFU 7.52% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.097421E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:33:46.783386 | finish at 2025-09-10 12:55:49 + [2025-09-09 20:22:08] iteration 2102/ 11920 | consumed samples: 2152448 | elapsed time per iteration (ms): 5980.5 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.100543E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:18:36.408384 | finish at 2025-09-10 12:40:44 + [2025-09-09 20:22:14] iteration 2103/ 11920 | consumed samples: 2153472 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.114118E+00 | loss scale: 1.0 | grad norm: 0.134 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:20:04.142810 | finish at 2025-09-10 11:42:18 + [2025-09-09 20:22:19] iteration 2104/ 11920 | consumed samples: 2154496 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.098788E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:19:52.544661 | finish at 2025-09-10 11:42:12 + [2025-09-09 20:22:25] iteration 2105/ 11920 | consumed samples: 2155520 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.105187E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:19:43.236325 | finish at 2025-09-10 11:42:08 + [2025-09-09 20:22:31] iteration 2106/ 11920 | consumed samples: 2156544 | elapsed time per iteration (ms): 5619.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.102049E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:19:10.490563 | finish at 2025-09-10 11:41:41 + [2025-09-09 20:22:36] iteration 2107/ 11920 | consumed samples: 2157568 | elapsed time per iteration (ms): 5632.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.109119E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:21:14.407707 | finish at 2025-09-10 11:43:51 + [2025-09-09 20:22:42] iteration 2108/ 11920 | consumed samples: 2158592 | elapsed time per iteration (ms): 5634.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.106147E+00 | loss scale: 1.0 | grad norm: 0.241 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:21:26.425428 | finish at 2025-09-10 11:44:08 + [2025-09-09 20:22:47] iteration 2109/ 11920 | consumed samples: 2159616 | elapsed time per iteration (ms): 5633.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.103855E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:21:04.940947 | finish at 2025-09-10 11:43:52 + [2025-09-09 20:22:53] iteration 2110/ 11920 | consumed samples: 2160640 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.103776E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:20:05.714750 | finish at 2025-09-10 11:42:59 + [2025-09-09 20:22:59] iteration 2111/ 11920 | consumed samples: 2161664 | elapsed time per iteration (ms): 5627.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.102623E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:20:01.829549 | finish at 2025-09-10 11:43:01 + [2025-09-09 20:23:04] iteration 2112/ 11920 | consumed samples: 2162688 | elapsed time per iteration (ms): 5627.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.112878E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:19:55.331989 | finish at 2025-09-10 11:43:00 + [2025-09-09 20:23:10] iteration 2113/ 11920 | consumed samples: 2163712 | elapsed time per iteration (ms): 5640.9 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.098715E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:21:59.863372 | finish at 2025-09-10 11:45:10 + [2025-09-09 20:23:16] iteration 2114/ 11920 | consumed samples: 2164736 | elapsed time per iteration (ms): 5626.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.103282E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:19:34.874721 | finish at 2025-09-10 11:42:50 + [2025-09-09 20:23:21] iteration 2115/ 11920 | consumed samples: 2165760 | elapsed time per iteration (ms): 5630.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.088862E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:20:10.653316 | finish at 2025-09-10 11:43:32 + [2025-09-09 20:23:27] iteration 2116/ 11920 | consumed samples: 2166784 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.106746E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:18:34.855093 | finish at 2025-09-10 11:42:02 + [2025-09-09 20:23:32] iteration 2117/ 11920 | consumed samples: 2167808 | elapsed time per iteration (ms): 5632.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.108728E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:20:17.346085 | finish at 2025-09-10 11:43:50 + [2025-09-09 20:23:38] iteration 2118/ 11920 | consumed samples: 2168832 | elapsed time per iteration (ms): 5948.6 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.118559E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:11:48.446480 | finish at 2025-09-10 12:35:27 + [2025-09-09 20:23:44] iteration 2119/ 11920 | consumed samples: 2169856 | elapsed time per iteration (ms): 5626.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.108283E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:19:03.946758 | finish at 2025-09-10 11:42:48 + [2025-09-09 20:23:50] iteration 2120/ 11920 | consumed samples: 2170880 | elapsed time per iteration (ms): 6343.1 | throughput per GPU (TFLOP/s/GPU): 71.2 | MFU 7.20% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.114839E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:16:02.794256 | finish at 2025-09-10 13:39:53 + [2025-09-09 20:23:57] iteration 2121/ 11920 | consumed samples: 2171904 | elapsed time per iteration (ms): 6265.8 | throughput per GPU (TFLOP/s/GPU): 72.1 | MFU 7.29% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.106500E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 17:03:18.627927 | finish at 2025-09-10 13:27:15 + [2025-09-09 20:24:02] iteration 2122/ 11920 | consumed samples: 2172928 | elapsed time per iteration (ms): 5630.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.099497E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:19:27.621078 | finish at 2025-09-10 11:43:30 + [2025-09-09 20:24:08] iteration 2123/ 11920 | consumed samples: 2173952 | elapsed time per iteration (ms): 5876.6 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.116028E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:59:32.847252 | finish at 2025-09-10 12:23:41 + [2025-09-09 20:24:14] iteration 2124/ 11920 | consumed samples: 2174976 | elapsed time per iteration (ms): 5639.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.094568E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:20:43.945482 | finish at 2025-09-10 11:44:58 + [2025-09-09 20:24:20] iteration 2125/ 11920 | consumed samples: 2176000 | elapsed time per iteration (ms): 5833.9 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.107820E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:52:22.893788 | finish at 2025-09-10 12:16:43 + [2025-09-09 20:24:25] iteration 2126/ 11920 | consumed samples: 2177024 | elapsed time per iteration (ms): 5635.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.108961E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:19:54.030510 | finish at 2025-09-10 11:44:19 + [2025-09-09 20:24:31] iteration 2127/ 11920 | consumed samples: 2178048 | elapsed time per iteration (ms): 5640.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.119944E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:20:35.864508 | finish at 2025-09-10 11:45:07 + [2025-09-09 20:24:37] iteration 2128/ 11920 | consumed samples: 2179072 | elapsed time per iteration (ms): 5633.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.097149E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:19:23.113907 | finish at 2025-09-10 11:44:00 + [2025-09-09 20:24:42] iteration 2129/ 11920 | consumed samples: 2180096 | elapsed time per iteration (ms): 5631.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.117715E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:18:53.817049 | finish at 2025-09-10 11:43:36 + [2025-09-09 20:24:48] iteration 2130/ 11920 | consumed samples: 2181120 | elapsed time per iteration (ms): 5630.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.099613E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:18:44.981234 | finish at 2025-09-10 11:43:33 + [2025-09-09 20:24:53] iteration 2131/ 11920 | consumed samples: 2182144 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.106256E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:17:50.763788 | finish at 2025-09-10 11:42:44 + [2025-09-09 20:24:59] iteration 2132/ 11920 | consumed samples: 2183168 | elapsed time per iteration (ms): 5633.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.099474E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:18:58.955741 | finish at 2025-09-10 11:43:58 + [2025-09-09 20:25:05] iteration 2133/ 11920 | consumed samples: 2184192 | elapsed time per iteration (ms): 5955.8 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.096532E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:11:29.888833 | finish at 2025-09-10 12:36:35 + [2025-09-09 20:25:11] iteration 2134/ 11920 | consumed samples: 2185216 | elapsed time per iteration (ms): 5965.9 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.115013E+00 | loss scale: 1.0 | grad norm: 0.128 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:13:02.047206 | finish at 2025-09-10 12:38:13 + [2025-09-09 20:25:17] iteration 2135/ 11920 | consumed samples: 2186240 | elapsed time per iteration (ms): 5628.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.102108E+00 | loss scale: 1.0 | grad norm: 0.137 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:17:56.216116 | finish at 2025-09-10 11:43:13 + [2025-09-09 20:25:22] iteration 2136/ 11920 | consumed samples: 2187264 | elapsed time per iteration (ms): 5633.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.101489E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:18:33.061050 | finish at 2025-09-10 11:43:55 + [2025-09-09 20:25:28] iteration 2137/ 11920 | consumed samples: 2188288 | elapsed time per iteration (ms): 6005.8 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.101450E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:19:14.512906 | finish at 2025-09-10 12:44:43 + [2025-09-09 20:25:34] iteration 2138/ 11920 | consumed samples: 2189312 | elapsed time per iteration (ms): 5628.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.109749E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:17:39.206597 | finish at 2025-09-10 11:43:13 + [2025-09-09 20:25:40] iteration 2139/ 11920 | consumed samples: 2190336 | elapsed time per iteration (ms): 5648.3 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.101073E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:20:46.051955 | finish at 2025-09-10 11:46:26 + [2025-09-09 20:25:45] iteration 2140/ 11920 | consumed samples: 2191360 | elapsed time per iteration (ms): 5631.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.111475E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:17:58.378472 | finish at 2025-09-10 11:43:44 + [2025-09-09 20:25:51] iteration 2141/ 11920 | consumed samples: 2192384 | elapsed time per iteration (ms): 5628.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.102154E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:17:19.900631 | finish at 2025-09-10 11:43:11 + [2025-09-09 20:25:56] iteration 2142/ 11920 | consumed samples: 2193408 | elapsed time per iteration (ms): 5632.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.099632E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:17:57.048486 | finish at 2025-09-10 11:43:53 + [2025-09-09 20:26:02] iteration 2143/ 11920 | consumed samples: 2194432 | elapsed time per iteration (ms): 5628.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.108096E+00 | loss scale: 1.0 | grad norm: 0.249 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:17:06.317520 | finish at 2025-09-10 11:43:08 + [2025-09-09 20:26:08] iteration 2144/ 11920 | consumed samples: 2195456 | elapsed time per iteration (ms): 5633.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.109517E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:17:52.162327 | finish at 2025-09-10 11:44:00 + [2025-09-09 20:26:13] iteration 2145/ 11920 | consumed samples: 2196480 | elapsed time per iteration (ms): 5645.6 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.104845E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:19:45.267687 | finish at 2025-09-10 11:45:59 + [2025-09-09 20:26:19] iteration 2146/ 11920 | consumed samples: 2197504 | elapsed time per iteration (ms): 5632.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.098972E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:17:35.500866 | finish at 2025-09-10 11:43:54 + [2025-09-09 20:26:25] iteration 2147/ 11920 | consumed samples: 2198528 | elapsed time per iteration (ms): 5633.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.105964E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:17:39.829040 | finish at 2025-09-10 11:44:04 + [2025-09-09 20:26:30] iteration 2148/ 11920 | consumed samples: 2199552 | elapsed time per iteration (ms): 5628.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.096129E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:16:40.942327 | finish at 2025-09-10 11:43:11 + [2025-09-09 20:26:36] iteration 2149/ 11920 | consumed samples: 2200576 | elapsed time per iteration (ms): 5635.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.096386E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:17:44.344255 | finish at 2025-09-10 11:44:20 + [2025-09-09 20:26:42] iteration 2150/ 11920 | consumed samples: 2201600 | elapsed time per iteration (ms): 5642.6 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.105900E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:18:47.766993 | finish at 2025-09-10 11:45:29 + [2025-09-09 20:26:47] iteration 2151/ 11920 | consumed samples: 2202624 | elapsed time per iteration (ms): 5630.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.108307E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:16:39.832130 | finish at 2025-09-10 11:43:27 + [2025-09-09 20:26:53] iteration 2152/ 11920 | consumed samples: 2203648 | elapsed time per iteration (ms): 5632.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.089851E+00 | loss scale: 1.0 | grad norm: 0.252 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:16:59.703249 | finish at 2025-09-10 11:43:52 + [2025-09-09 20:26:58] iteration 2153/ 11920 | consumed samples: 2204672 | elapsed time per iteration (ms): 5630.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.102572E+00 | loss scale: 1.0 | grad norm: 0.276 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:16:35.336739 | finish at 2025-09-10 11:43:34 + [2025-09-09 20:27:04] iteration 2154/ 11920 | consumed samples: 2205696 | elapsed time per iteration (ms): 5628.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.109943E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:16:04.412645 | finish at 2025-09-10 11:43:08 + [2025-09-09 20:27:10] iteration 2155/ 11920 | consumed samples: 2206720 | elapsed time per iteration (ms): 5630.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.117433E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:16:20.722733 | finish at 2025-09-10 11:43:30 + [2025-09-09 20:27:15] iteration 2156/ 11920 | consumed samples: 2207744 | elapsed time per iteration (ms): 5629.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.099638E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:16:01.266835 | finish at 2025-09-10 11:43:17 + [2025-09-09 20:27:21] iteration 2157/ 11920 | consumed samples: 2208768 | elapsed time per iteration (ms): 5626.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.095145E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:15:28.860227 | finish at 2025-09-10 11:42:50 + [2025-09-09 20:27:27] iteration 2158/ 11920 | consumed samples: 2209792 | elapsed time per iteration (ms): 5629.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.096626E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:15:50.756003 | finish at 2025-09-10 11:43:17 + [2025-09-09 20:27:32] iteration 2159/ 11920 | consumed samples: 2210816 | elapsed time per iteration (ms): 5628.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.105939E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:15:42.027121 | finish at 2025-09-10 11:43:14 + [2025-09-09 20:27:38] iteration 2160/ 11920 | consumed samples: 2211840 | elapsed time per iteration (ms): 5642.8 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.094663E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:17:53.296089 | finish at 2025-09-10 11:45:31 + [2025-09-09 20:27:44] iteration 2161/ 11920 | consumed samples: 2212864 | elapsed time per iteration (ms): 5874.0 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.106710E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:55:24.082847 | finish at 2025-09-10 12:23:08 + [2025-09-09 20:27:49] iteration 2162/ 11920 | consumed samples: 2213888 | elapsed time per iteration (ms): 5629.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.103741E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:15:33.949018 | finish at 2025-09-10 11:43:23 + [2025-09-09 20:27:55] iteration 2163/ 11920 | consumed samples: 2214912 | elapsed time per iteration (ms): 5981.0 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.103688E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:12:36.372223 | finish at 2025-09-10 12:40:32 + [2025-09-09 20:28:01] iteration 2164/ 11920 | consumed samples: 2215936 | elapsed time per iteration (ms): 5626.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.100122E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:14:53.886752 | finish at 2025-09-10 11:42:55 + [2025-09-09 20:28:07] iteration 2165/ 11920 | consumed samples: 2216960 | elapsed time per iteration (ms): 5635.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.089378E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:16:11.948371 | finish at 2025-09-10 11:44:19 + [2025-09-09 20:28:12] iteration 2166/ 11920 | consumed samples: 2217984 | elapsed time per iteration (ms): 5628.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.098330E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:14:55.077330 | finish at 2025-09-10 11:43:07 + [2025-09-09 20:28:18] iteration 2167/ 11920 | consumed samples: 2219008 | elapsed time per iteration (ms): 5631.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.092765E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:15:23.538219 | finish at 2025-09-10 11:43:41 + [2025-09-09 20:28:23] iteration 2168/ 11920 | consumed samples: 2220032 | elapsed time per iteration (ms): 5628.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.096668E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:14:45.772142 | finish at 2025-09-10 11:43:09 + [2025-09-09 20:28:29] iteration 2169/ 11920 | consumed samples: 2221056 | elapsed time per iteration (ms): 5630.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.105747E+00 | loss scale: 1.0 | grad norm: 0.252 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:15:04.229118 | finish at 2025-09-10 11:43:33 + [2025-09-09 20:28:35] iteration 2170/ 11920 | consumed samples: 2222080 | elapsed time per iteration (ms): 5629.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.099305E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:14:43.546829 | finish at 2025-09-10 11:43:18 + [2025-09-09 20:28:40] iteration 2171/ 11920 | consumed samples: 2223104 | elapsed time per iteration (ms): 5638.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.106075E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:16:09.194686 | finish at 2025-09-10 11:44:50 + [2025-09-09 20:28:46] iteration 2172/ 11920 | consumed samples: 2224128 | elapsed time per iteration (ms): 5875.6 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.109233E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:54:35.068554 | finish at 2025-09-10 12:23:21 + [2025-09-09 20:28:52] iteration 2173/ 11920 | consumed samples: 2225152 | elapsed time per iteration (ms): 5954.8 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.119650E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:07:21.583260 | finish at 2025-09-10 12:36:14 + [2025-09-09 20:28:58] iteration 2174/ 11920 | consumed samples: 2226176 | elapsed time per iteration (ms): 5627.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.081690E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:14:07.367573 | finish at 2025-09-10 11:43:05 + [2025-09-09 20:29:03] iteration 2175/ 11920 | consumed samples: 2227200 | elapsed time per iteration (ms): 5628.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.079600E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:14:05.747739 | finish at 2025-09-10 11:43:09 + [2025-09-09 20:29:09] iteration 2176/ 11920 | consumed samples: 2228224 | elapsed time per iteration (ms): 5630.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.117014E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:14:19.051003 | finish at 2025-09-10 11:43:28 + [2025-09-09 20:29:15] iteration 2177/ 11920 | consumed samples: 2229248 | elapsed time per iteration (ms): 5630.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.098673E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:14:16.045859 | finish at 2025-09-10 11:43:31 + [2025-09-09 20:29:20] iteration 2178/ 11920 | consumed samples: 2230272 | elapsed time per iteration (ms): 5626.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.090270E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:13:32.186668 | finish at 2025-09-10 11:42:53 + [2025-09-09 20:29:26] iteration 2179/ 11920 | consumed samples: 2231296 | elapsed time per iteration (ms): 5840.8 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.090338E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:48:15.479435 | finish at 2025-09-10 12:17:42 + [2025-09-09 20:29:32] iteration 2180/ 11920 | consumed samples: 2232320 | elapsed time per iteration (ms): 5619.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.087679E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:12:16.235180 | finish at 2025-09-10 11:41:48 + [2025-09-09 20:29:37] iteration 2181/ 11920 | consumed samples: 2233344 | elapsed time per iteration (ms): 5630.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.091630E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:13:55.983600 | finish at 2025-09-10 11:43:33 + [2025-09-09 20:29:43] iteration 2182/ 11920 | consumed samples: 2234368 | elapsed time per iteration (ms): 5629.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.091197E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:13:39.821722 | finish at 2025-09-10 11:43:23 + [2025-09-09 20:29:49] iteration 2183/ 11920 | consumed samples: 2235392 | elapsed time per iteration (ms): 5633.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.080811E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:14:09.713239 | finish at 2025-09-10 11:43:58 + [2025-09-09 20:29:54] iteration 2184/ 11920 | consumed samples: 2236416 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.097589E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:12:50.060310 | finish at 2025-09-10 11:42:44 + [2025-09-09 20:30:00] iteration 2185/ 11920 | consumed samples: 2237440 | elapsed time per iteration (ms): 5631.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.085489E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:13:38.848429 | finish at 2025-09-10 11:43:39 + [2025-09-09 20:30:06] iteration 2186/ 11920 | consumed samples: 2238464 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.090334E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:12:57.841877 | finish at 2025-09-10 11:43:03 + [2025-09-09 20:30:11] iteration 2187/ 11920 | consumed samples: 2239488 | elapsed time per iteration (ms): 5634.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.109941E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:13:59.173238 | finish at 2025-09-10 11:44:10 + [2025-09-09 20:30:17] iteration 2188/ 11920 | consumed samples: 2240512 | elapsed time per iteration (ms): 5917.8 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.087518E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:59:52.172968 | finish at 2025-09-10 12:30:09 + [2025-09-09 20:30:23] iteration 2189/ 11920 | consumed samples: 2241536 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.093407E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:12:39.142851 | finish at 2025-09-10 11:43:02 + [2025-09-09 20:30:28] iteration 2190/ 11920 | consumed samples: 2242560 | elapsed time per iteration (ms): 5629.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.099608E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:12:55.602500 | finish at 2025-09-10 11:43:24 + [2025-09-09 20:30:34] iteration 2191/ 11920 | consumed samples: 2243584 | elapsed time per iteration (ms): 5636.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.097811E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:13:59.993933 | finish at 2025-09-10 11:44:34 + [2025-09-09 20:30:40] iteration 2192/ 11920 | consumed samples: 2244608 | elapsed time per iteration (ms): 5641.4 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.098677E+00 | loss scale: 1.0 | grad norm: 0.229 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:14:39.704834 | finish at 2025-09-10 11:45:19 + [2025-09-09 20:30:45] iteration 2193/ 11920 | consumed samples: 2245632 | elapsed time per iteration (ms): 5630.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.109297E+00 | loss scale: 1.0 | grad norm: 0.229 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:12:42.844138 | finish at 2025-09-10 11:43:28 + [2025-09-09 20:30:51] iteration 2194/ 11920 | consumed samples: 2246656 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.101375E+00 | loss scale: 1.0 | grad norm: 0.252 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:12:01.524595 | finish at 2025-09-10 11:42:52 + [2025-09-09 20:30:57] iteration 2195/ 11920 | consumed samples: 2247680 | elapsed time per iteration (ms): 5617.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.104570E+00 | loss scale: 1.0 | grad norm: 0.242 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:10:29.984111 | finish at 2025-09-10 11:41:27 + [2025-09-09 20:31:02] iteration 2196/ 11920 | consumed samples: 2248704 | elapsed time per iteration (ms): 5625.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.108904E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:11:46.657610 | finish at 2025-09-10 11:42:49 + [2025-09-09 20:31:08] iteration 2197/ 11920 | consumed samples: 2249728 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.099765E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:11:53.929821 | finish at 2025-09-10 11:43:02 + [2025-09-09 20:31:13] iteration 2198/ 11920 | consumed samples: 2250752 | elapsed time per iteration (ms): 5629.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.098947E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:12:12.179296 | finish at 2025-09-10 11:43:26 + [2025-09-09 20:31:19] iteration 2199/ 11920 | consumed samples: 2251776 | elapsed time per iteration (ms): 5637.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.079522E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:13:17.655596 | finish at 2025-09-10 11:44:37 + [2025-09-09 20:31:25] iteration 2200/ 11920 | consumed samples: 2252800 | elapsed time per iteration (ms): 5639.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.090632E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:13:31.413116 | finish at 2025-09-10 11:44:56 + [2025-09-09 20:31:30] iteration 2201/ 11920 | consumed samples: 2253824 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.084774E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:11:17.635783 | finish at 2025-09-10 11:42:48 + [2025-09-09 20:31:36] iteration 2202/ 11920 | consumed samples: 2254848 | elapsed time per iteration (ms): 5631.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.093430E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:12:05.872110 | finish at 2025-09-10 11:43:42 + [2025-09-09 20:31:42] iteration 2203/ 11920 | consumed samples: 2255872 | elapsed time per iteration (ms): 5639.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.094083E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:13:20.470817 | finish at 2025-09-10 11:45:02 + [2025-09-09 20:31:47] iteration 2204/ 11920 | consumed samples: 2256896 | elapsed time per iteration (ms): 5631.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.090143E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:11:55.927400 | finish at 2025-09-10 11:43:43 + [2025-09-09 20:31:53] iteration 2205/ 11920 | consumed samples: 2257920 | elapsed time per iteration (ms): 5629.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.103678E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:11:25.963807 | finish at 2025-09-10 11:43:19 + [2025-09-09 20:31:58] iteration 2206/ 11920 | consumed samples: 2258944 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.100677E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:10:31.214780 | finish at 2025-09-10 11:42:30 + [2025-09-09 20:32:04] iteration 2207/ 11920 | consumed samples: 2259968 | elapsed time per iteration (ms): 5626.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.082301E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:10:52.191945 | finish at 2025-09-10 11:42:56 + [2025-09-09 20:32:10] iteration 2208/ 11920 | consumed samples: 2260992 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.096346E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:09:52.426037 | finish at 2025-09-10 11:42:02 + [2025-09-09 20:32:16] iteration 2209/ 11920 | consumed samples: 2262016 | elapsed time per iteration (ms): 5948.1 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.093496E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:02:41.692042 | finish at 2025-09-10 12:34:57 + [2025-09-09 20:32:21] iteration 2210/ 11920 | consumed samples: 2263040 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.069593E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:10:18.543961 | finish at 2025-09-10 11:42:40 + [2025-09-09 20:32:27] iteration 2211/ 11920 | consumed samples: 2264064 | elapsed time per iteration (ms): 5630.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.092704E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:11:09.504415 | finish at 2025-09-10 11:43:36 + [2025-09-09 20:32:33] iteration 2212/ 11920 | consumed samples: 2265088 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.084092E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:09:57.984813 | finish at 2025-09-10 11:42:31 + [2025-09-09 20:32:38] iteration 2213/ 11920 | consumed samples: 2266112 | elapsed time per iteration (ms): 5633.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.087811E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:11:19.759119 | finish at 2025-09-10 11:43:58 + [2025-09-09 20:32:44] iteration 2214/ 11920 | consumed samples: 2267136 | elapsed time per iteration (ms): 5628.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.089266E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:10:26.578473 | finish at 2025-09-10 11:43:10 + [2025-09-09 20:32:49] iteration 2215/ 11920 | consumed samples: 2268160 | elapsed time per iteration (ms): 5629.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.093749E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:10:36.430020 | finish at 2025-09-10 11:43:26 + [2025-09-09 20:32:55] iteration 2216/ 11920 | consumed samples: 2269184 | elapsed time per iteration (ms): 5628.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.088060E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:10:19.611664 | finish at 2025-09-10 11:43:15 + [2025-09-09 20:33:01] iteration 2217/ 11920 | consumed samples: 2270208 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.096173E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:09:09.819315 | finish at 2025-09-10 11:42:11 + [2025-09-09 20:33:06] iteration 2218/ 11920 | consumed samples: 2271232 | elapsed time per iteration (ms): 5634.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.077034E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:11:05.669440 | finish at 2025-09-10 11:44:12 + [2025-09-09 20:33:12] iteration 2219/ 11920 | consumed samples: 2272256 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.084767E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:08:42.172330 | finish at 2025-09-10 11:41:54 + [2025-09-09 20:33:18] iteration 2220/ 11920 | consumed samples: 2273280 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.084446E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:09:06.940422 | finish at 2025-09-10 11:42:25 + [2025-09-09 20:33:23] iteration 2221/ 11920 | consumed samples: 2274304 | elapsed time per iteration (ms): 5828.6 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.082226E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:42:11.137485 | finish at 2025-09-10 12:15:35 + [2025-09-09 20:33:29] iteration 2222/ 11920 | consumed samples: 2275328 | elapsed time per iteration (ms): 5962.7 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.095695E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:03:46.036941 | finish at 2025-09-10 12:37:15 + [2025-09-09 20:33:35] iteration 2223/ 11920 | consumed samples: 2276352 | elapsed time per iteration (ms): 5630.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.084231E+00 | loss scale: 1.0 | grad norm: 0.262 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:10:02.119687 | finish at 2025-09-10 11:43:37 + [2025-09-09 20:33:41] iteration 2224/ 11920 | consumed samples: 2277376 | elapsed time per iteration (ms): 5968.9 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.100227E+00 | loss scale: 1.0 | grad norm: 0.278 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:04:34.114243 | finish at 2025-09-10 12:38:15 + [2025-09-09 20:33:47] iteration 2225/ 11920 | consumed samples: 2278400 | elapsed time per iteration (ms): 5630.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.076164E+00 | loss scale: 1.0 | grad norm: 0.273 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:09:51.401230 | finish at 2025-09-10 11:43:38 + [2025-09-09 20:33:52] iteration 2226/ 11920 | consumed samples: 2279424 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.102770E+00 | loss scale: 1.0 | grad norm: 0.254 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:08:42.629864 | finish at 2025-09-10 11:42:35 + [2025-09-09 20:33:58] iteration 2227/ 11920 | consumed samples: 2280448 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.100078E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:08:41.507306 | finish at 2025-09-10 11:42:39 + [2025-09-09 20:34:04] iteration 2228/ 11920 | consumed samples: 2281472 | elapsed time per iteration (ms): 5934.4 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.096315E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:58:36.224077 | finish at 2025-09-10 12:32:40 + [2025-09-09 20:34:09] iteration 2229/ 11920 | consumed samples: 2282496 | elapsed time per iteration (ms): 5624.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.082194E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:08:27.623653 | finish at 2025-09-10 11:42:37 + [2025-09-09 20:34:15] iteration 2230/ 11920 | consumed samples: 2283520 | elapsed time per iteration (ms): 5629.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.087678E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:09:05.131946 | finish at 2025-09-10 11:43:20 + [2025-09-09 20:34:21] iteration 2231/ 11920 | consumed samples: 2284544 | elapsed time per iteration (ms): 5630.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.096443E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:09:12.674767 | finish at 2025-09-10 11:43:33 + [2025-09-09 20:34:26] iteration 2232/ 11920 | consumed samples: 2285568 | elapsed time per iteration (ms): 5635.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.097215E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:09:58.922485 | finish at 2025-09-10 11:44:25 + [2025-09-09 20:34:32] iteration 2233/ 11920 | consumed samples: 2286592 | elapsed time per iteration (ms): 5637.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.093573E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:10:09.292015 | finish at 2025-09-10 11:44:41 + [2025-09-09 20:34:38] iteration 2234/ 11920 | consumed samples: 2287616 | elapsed time per iteration (ms): 5635.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.112638E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:09:47.145290 | finish at 2025-09-10 11:44:25 + [2025-09-09 20:34:43] iteration 2235/ 11920 | consumed samples: 2288640 | elapsed time per iteration (ms): 5629.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.084972E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:08:40.923871 | finish at 2025-09-10 11:43:24 + [2025-09-09 20:34:49] iteration 2236/ 11920 | consumed samples: 2289664 | elapsed time per iteration (ms): 5994.8 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.094614E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:07:33.958082 | finish at 2025-09-10 12:42:23 + [2025-09-09 20:34:55] iteration 2237/ 11920 | consumed samples: 2290688 | elapsed time per iteration (ms): 5629.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.068405E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:08:34.436924 | finish at 2025-09-10 11:43:29 + [2025-09-09 20:35:01] iteration 2238/ 11920 | consumed samples: 2291712 | elapsed time per iteration (ms): 5836.5 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.072951E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:41:49.077726 | finish at 2025-09-10 12:16:50 + [2025-09-09 20:35:07] iteration 2239/ 11920 | consumed samples: 2292736 | elapsed time per iteration (ms): 5875.1 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.089129E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:47:56.941356 | finish at 2025-09-10 12:23:03 + [2025-09-09 20:35:12] iteration 2240/ 11920 | consumed samples: 2293760 | elapsed time per iteration (ms): 5949.5 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.084238E+00 | loss scale: 1.0 | grad norm: 0.137 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:59:50.930023 | finish at 2025-09-10 12:35:03 + [2025-09-09 20:35:18] iteration 2241/ 11920 | consumed samples: 2294784 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.079824E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:07:35.454044 | finish at 2025-09-10 11:42:54 + [2025-09-09 20:35:24] iteration 2242/ 11920 | consumed samples: 2295808 | elapsed time per iteration (ms): 5828.4 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.077913E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:40:07.651075 | finish at 2025-09-10 12:15:32 + [2025-09-09 20:35:30] iteration 2243/ 11920 | consumed samples: 2296832 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.081115E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:07:19.905792 | finish at 2025-09-10 11:42:49 + [2025-09-09 20:35:35] iteration 2244/ 11920 | consumed samples: 2297856 | elapsed time per iteration (ms): 5628.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.088401E+00 | loss scale: 1.0 | grad norm: 0.142 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:07:43.492849 | finish at 2025-09-10 11:43:19 + [2025-09-09 20:35:41] iteration 2245/ 11920 | consumed samples: 2298880 | elapsed time per iteration (ms): 5641.9 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.092566E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:09:45.293144 | finish at 2025-09-10 11:45:26 + [2025-09-09 20:35:46] iteration 2246/ 11920 | consumed samples: 2299904 | elapsed time per iteration (ms): 5634.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.087549E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:08:27.491304 | finish at 2025-09-10 11:44:14 + [2025-09-09 20:35:52] iteration 2247/ 11920 | consumed samples: 2300928 | elapsed time per iteration (ms): 5980.1 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.097697E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:04:05.917000 | finish at 2025-09-10 12:39:58 + [2025-09-09 20:35:58] iteration 2248/ 11920 | consumed samples: 2301952 | elapsed time per iteration (ms): 5636.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.087098E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:08:37.024727 | finish at 2025-09-10 11:44:35 + [2025-09-09 20:36:04] iteration 2249/ 11920 | consumed samples: 2302976 | elapsed time per iteration (ms): 5913.1 | throughput per GPU (TFLOP/s/GPU): 76.4 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.091266E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:53:05.541720 | finish at 2025-09-10 12:29:10 + [2025-09-09 20:36:10] iteration 2250/ 11920 | consumed samples: 2304000 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.078722E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:06:04.693687 | finish at 2025-09-10 11:42:14 + [2025-09-09 20:36:15] iteration 2251/ 11920 | consumed samples: 2305024 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.085088E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:05:51.146176 | finish at 2025-09-10 11:42:06 + [2025-09-09 20:36:21] iteration 2252/ 11920 | consumed samples: 2306048 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.073182E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:05:41.410521 | finish at 2025-09-10 11:42:02 + [2025-09-09 20:36:27] iteration 2253/ 11920 | consumed samples: 2307072 | elapsed time per iteration (ms): 5627.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.093057E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:06:37.521330 | finish at 2025-09-10 11:43:04 + [2025-09-09 20:36:32] iteration 2254/ 11920 | consumed samples: 2308096 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.098237E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:05:37.704912 | finish at 2025-09-10 11:42:10 + [2025-09-09 20:36:38] iteration 2255/ 11920 | consumed samples: 2309120 | elapsed time per iteration (ms): 5630.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.092484E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:07:00.990790 | finish at 2025-09-10 11:43:39 + [2025-09-09 20:36:44] iteration 2256/ 11920 | consumed samples: 2310144 | elapsed time per iteration (ms): 5966.9 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.094594E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:01:03.846176 | finish at 2025-09-10 12:37:48 + [2025-09-09 20:36:49] iteration 2257/ 11920 | consumed samples: 2311168 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.097507E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:05:43.844153 | finish at 2025-09-10 11:42:33 + [2025-09-09 20:36:55] iteration 2258/ 11920 | consumed samples: 2312192 | elapsed time per iteration (ms): 5645.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.079222E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:09:02.501215 | finish at 2025-09-10 11:45:57 + [2025-09-09 20:37:01] iteration 2259/ 11920 | consumed samples: 2313216 | elapsed time per iteration (ms): 6199.1 | throughput per GPU (TFLOP/s/GPU): 72.8 | MFU 7.36% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.076914E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:38:09.763246 | finish at 2025-09-10 13:15:11 + [2025-09-09 20:37:07] iteration 2260/ 11920 | consumed samples: 2314240 | elapsed time per iteration (ms): 6107.9 | throughput per GPU (TFLOP/s/GPU): 73.9 | MFU 7.47% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.082365E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:23:22.448959 | finish at 2025-09-10 13:00:30 + [2025-09-09 20:37:13] iteration 2261/ 11920 | consumed samples: 2315264 | elapsed time per iteration (ms): 5896.7 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.076286E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:49:16.486915 | finish at 2025-09-10 12:26:30 + [2025-09-09 20:37:19] iteration 2262/ 11920 | consumed samples: 2316288 | elapsed time per iteration (ms): 5616.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.086041E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:04:03.299458 | finish at 2025-09-10 11:41:22 + [2025-09-09 20:37:25] iteration 2263/ 11920 | consumed samples: 2317312 | elapsed time per iteration (ms): 5844.2 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.086725E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:40:37.212758 | finish at 2025-09-10 12:18:02 + [2025-09-09 20:37:30] iteration 2264/ 11920 | consumed samples: 2318336 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.076236E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:04:50.318438 | finish at 2025-09-10 11:42:21 + [2025-09-09 20:37:36] iteration 2265/ 11920 | consumed samples: 2319360 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.079431E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:05:01.495489 | finish at 2025-09-10 11:42:37 + [2025-09-09 20:37:42] iteration 2266/ 11920 | consumed samples: 2320384 | elapsed time per iteration (ms): 5927.0 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.073169E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:53:39.187089 | finish at 2025-09-10 12:31:21 + [2025-09-09 20:37:47] iteration 2267/ 11920 | consumed samples: 2321408 | elapsed time per iteration (ms): 5629.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.086131E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:05:40.441845 | finish at 2025-09-10 11:43:28 + [2025-09-09 20:37:53] iteration 2268/ 11920 | consumed samples: 2322432 | elapsed time per iteration (ms): 5626.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.074278E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:05:09.280468 | finish at 2025-09-10 11:43:02 + [2025-09-09 20:37:59] iteration 2269/ 11920 | consumed samples: 2323456 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.079196E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:04:39.753474 | finish at 2025-09-10 11:42:38 + [2025-09-09 20:38:05] iteration 2270/ 11920 | consumed samples: 2324480 | elapsed time per iteration (ms): 5956.9 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.097035E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:58:04.514093 | finish at 2025-09-10 12:36:09 + [2025-09-09 20:38:11] iteration 2271/ 11920 | consumed samples: 2325504 | elapsed time per iteration (ms): 6217.4 | throughput per GPU (TFLOP/s/GPU): 72.6 | MFU 7.34% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.071041E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:39:51.486269 | finish at 2025-09-10 13:18:02 + [2025-09-09 20:38:17] iteration 2272/ 11920 | consumed samples: 2326528 | elapsed time per iteration (ms): 5627.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.083068E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:04:55.670929 | finish at 2025-09-10 11:43:12 + [2025-09-09 20:38:22] iteration 2273/ 11920 | consumed samples: 2327552 | elapsed time per iteration (ms): 5849.4 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.090125E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:40:28.681515 | finish at 2025-09-10 12:18:51 + [2025-09-09 20:38:28] iteration 2274/ 11920 | consumed samples: 2328576 | elapsed time per iteration (ms): 5639.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.067568E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:06:36.074799 | finish at 2025-09-10 11:45:04 + [2025-09-09 20:38:34] iteration 2275/ 11920 | consumed samples: 2329600 | elapsed time per iteration (ms): 5842.3 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.073207E+00 | loss scale: 1.0 | grad norm: 0.128 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:39:09.355431 | finish at 2025-09-10 12:17:43 + [2025-09-09 20:38:39] iteration 2276/ 11920 | consumed samples: 2330624 | elapsed time per iteration (ms): 5639.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.084530E+00 | loss scale: 1.0 | grad norm: 0.125 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:06:25.897695 | finish at 2025-09-10 11:45:05 + [2025-09-09 20:38:45] iteration 2277/ 11920 | consumed samples: 2331648 | elapsed time per iteration (ms): 5627.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.066780E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:04:23.518448 | finish at 2025-09-10 11:43:09 + [2025-09-09 20:38:51] iteration 2278/ 11920 | consumed samples: 2332672 | elapsed time per iteration (ms): 5642.8 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.075702E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:06:47.483095 | finish at 2025-09-10 11:45:38 + [2025-09-09 20:38:56] iteration 2279/ 11920 | consumed samples: 2333696 | elapsed time per iteration (ms): 5631.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.071256E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:04:51.413604 | finish at 2025-09-10 11:43:48 + [2025-09-09 20:39:02] iteration 2280/ 11920 | consumed samples: 2334720 | elapsed time per iteration (ms): 5627.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.079160E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:04:05.158873 | finish at 2025-09-10 11:43:07 + [2025-09-09 20:39:08] iteration 2281/ 11920 | consumed samples: 2335744 | elapsed time per iteration (ms): 5623.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.079427E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:03:20.211005 | finish at 2025-09-10 11:42:28 + [2025-09-09 20:39:13] iteration 2282/ 11920 | consumed samples: 2336768 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.081388E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:03:11.784582 | finish at 2025-09-10 11:42:25 + [2025-09-09 20:39:19] iteration 2283/ 11920 | consumed samples: 2337792 | elapsed time per iteration (ms): 5855.4 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.079490E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:40:28.150573 | finish at 2025-09-10 12:19:47 + [2025-09-09 20:39:25] iteration 2284/ 11920 | consumed samples: 2338816 | elapsed time per iteration (ms): 5872.0 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.072716E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:43:02.768698 | finish at 2025-09-10 12:22:28 + [2025-09-09 20:39:31] iteration 2285/ 11920 | consumed samples: 2339840 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.067790E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:03:21.278661 | finish at 2025-09-10 11:42:52 + [2025-09-09 20:39:37] iteration 2286/ 11920 | consumed samples: 2340864 | elapsed time per iteration (ms): 6336.3 | throughput per GPU (TFLOP/s/GPU): 71.3 | MFU 7.20% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.068643E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:57:24.335835 | finish at 2025-09-10 13:37:01 + [2025-09-09 20:39:43] iteration 2287/ 11920 | consumed samples: 2341888 | elapsed time per iteration (ms): 5629.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.093001E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:03:50.192195 | finish at 2025-09-10 11:43:33 + [2025-09-09 20:39:48] iteration 2288/ 11920 | consumed samples: 2342912 | elapsed time per iteration (ms): 5629.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.075242E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:03:38.330009 | finish at 2025-09-10 11:43:27 + [2025-09-09 20:39:54] iteration 2289/ 11920 | consumed samples: 2343936 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.066724E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:02:05.268267 | finish at 2025-09-10 11:41:59 + [2025-09-09 20:39:59] iteration 2290/ 11920 | consumed samples: 2344960 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.053125E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:02:46.121113 | finish at 2025-09-10 11:42:46 + [2025-09-09 20:40:05] iteration 2291/ 11920 | consumed samples: 2345984 | elapsed time per iteration (ms): 5628.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.078373E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:03:18.839710 | finish at 2025-09-10 11:43:24 + [2025-09-09 20:40:11] iteration 2292/ 11920 | consumed samples: 2347008 | elapsed time per iteration (ms): 5629.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.073497E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:03:17.914469 | finish at 2025-09-10 11:43:29 + [2025-09-09 20:40:16] iteration 2293/ 11920 | consumed samples: 2348032 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.073292E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:01:52.692691 | finish at 2025-09-10 11:42:09 + [2025-09-09 20:40:22] iteration 2294/ 11920 | consumed samples: 2349056 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.088729E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:02:13.333643 | finish at 2025-09-10 11:42:35 + [2025-09-09 20:40:28] iteration 2295/ 11920 | consumed samples: 2350080 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.072735E+00 | loss scale: 1.0 | grad norm: 0.245 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:01:43.084713 | finish at 2025-09-10 11:42:11 + [2025-09-09 20:40:33] iteration 2296/ 11920 | consumed samples: 2351104 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.070364E+00 | loss scale: 1.0 | grad norm: 0.261 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:02:12.079050 | finish at 2025-09-10 11:42:45 + [2025-09-09 20:40:39] iteration 2297/ 11920 | consumed samples: 2352128 | elapsed time per iteration (ms): 5633.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.063840E+00 | loss scale: 1.0 | grad norm: 0.229 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:03:32.614570 | finish at 2025-09-10 11:44:11 + [2025-09-09 20:40:44] iteration 2298/ 11920 | consumed samples: 2353152 | elapsed time per iteration (ms): 5643.5 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.073721E+00 | loss scale: 1.0 | grad norm: 0.230 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:05:01.819802 | finish at 2025-09-10 11:45:46 + [2025-09-09 20:40:50] iteration 2299/ 11920 | consumed samples: 2354176 | elapsed time per iteration (ms): 5633.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.069950E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:03:18.656613 | finish at 2025-09-10 11:44:09 + [2025-09-09 20:40:56] iteration 2300/ 11920 | consumed samples: 2355200 | elapsed time per iteration (ms): 5630.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.080710E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:02:43.098817 | finish at 2025-09-10 11:43:39 + [2025-09-09 20:41:01] iteration 2301/ 11920 | consumed samples: 2356224 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.081134E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:02:28.130043 | finish at 2025-09-10 11:43:30 + [2025-09-09 20:41:07] iteration 2302/ 11920 | consumed samples: 2357248 | elapsed time per iteration (ms): 5635.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.060512E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:03:26.418899 | finish at 2025-09-10 11:44:33 + [2025-09-09 20:41:13] iteration 2303/ 11920 | consumed samples: 2358272 | elapsed time per iteration (ms): 5626.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.073923E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:01:45.440783 | finish at 2025-09-10 11:42:58 + [2025-09-09 20:41:18] iteration 2304/ 11920 | consumed samples: 2359296 | elapsed time per iteration (ms): 5627.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.076115E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:01:54.969067 | finish at 2025-09-10 11:43:13 + [2025-09-09 20:41:24] iteration 2305/ 11920 | consumed samples: 2360320 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.058146E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:01:02.047076 | finish at 2025-09-10 11:42:26 + [2025-09-09 20:41:30] iteration 2306/ 11920 | consumed samples: 2361344 | elapsed time per iteration (ms): 5628.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.098630E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:01:52.969600 | finish at 2025-09-10 11:43:22 + [2025-09-09 20:41:35] iteration 2307/ 11920 | consumed samples: 2362368 | elapsed time per iteration (ms): 5636.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.084375E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:03:01.750444 | finish at 2025-09-10 11:44:37 + [2025-09-09 20:41:41] iteration 2308/ 11920 | consumed samples: 2363392 | elapsed time per iteration (ms): 5640.8 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.062501E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:03:39.406260 | finish at 2025-09-10 11:45:20 + [2025-09-09 20:41:46] iteration 2309/ 11920 | consumed samples: 2364416 | elapsed time per iteration (ms): 5632.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.072415E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:02:12.345975 | finish at 2025-09-10 11:43:59 + [2025-09-09 20:41:52] iteration 2310/ 11920 | consumed samples: 2365440 | elapsed time per iteration (ms): 5634.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.082679E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:02:22.930775 | finish at 2025-09-10 11:44:15 + [2025-09-09 20:41:58] iteration 2311/ 11920 | consumed samples: 2366464 | elapsed time per iteration (ms): 5626.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.090250E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:01:06.547201 | finish at 2025-09-10 11:43:04 + [2025-09-09 20:42:03] iteration 2312/ 11920 | consumed samples: 2367488 | elapsed time per iteration (ms): 5626.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.062889E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:00:58.824530 | finish at 2025-09-10 11:43:02 + [2025-09-09 20:42:09] iteration 2313/ 11920 | consumed samples: 2368512 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.064857E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:00:44.709545 | finish at 2025-09-10 11:42:54 + [2025-09-09 20:42:15] iteration 2314/ 11920 | consumed samples: 2369536 | elapsed time per iteration (ms): 5633.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.073109E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:01:58.486918 | finish at 2025-09-10 11:44:13 + [2025-09-09 20:42:20] iteration 2315/ 11920 | consumed samples: 2370560 | elapsed time per iteration (ms): 5630.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.082675E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:01:21.569264 | finish at 2025-09-10 11:43:42 + [2025-09-09 20:42:26] iteration 2316/ 11920 | consumed samples: 2371584 | elapsed time per iteration (ms): 5946.6 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.070760E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:51:50.825451 | finish at 2025-09-10 12:34:17 + [2025-09-09 20:42:32] iteration 2317/ 11920 | consumed samples: 2372608 | elapsed time per iteration (ms): 5627.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.086550E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:00:37.091582 | finish at 2025-09-10 11:43:09 + [2025-09-09 20:42:37] iteration 2318/ 11920 | consumed samples: 2373632 | elapsed time per iteration (ms): 5631.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.077127E+00 | loss scale: 1.0 | grad norm: 0.137 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:01:16.540699 | finish at 2025-09-10 11:43:54 + [2025-09-09 20:42:43] iteration 2319/ 11920 | consumed samples: 2374656 | elapsed time per iteration (ms): 5638.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.061257E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:02:17.053484 | finish at 2025-09-10 11:45:00 + [2025-09-09 20:42:49] iteration 2320/ 11920 | consumed samples: 2375680 | elapsed time per iteration (ms): 5630.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.068606E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:00:55.929565 | finish at 2025-09-10 11:43:45 + [2025-09-09 20:42:54] iteration 2321/ 11920 | consumed samples: 2376704 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.068600E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:00:18.750665 | finish at 2025-09-10 11:43:13 + [2025-09-09 20:43:00] iteration 2322/ 11920 | consumed samples: 2377728 | elapsed time per iteration (ms): 5628.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.069995E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:00:23.887484 | finish at 2025-09-10 11:43:24 + [2025-09-09 20:43:06] iteration 2323/ 11920 | consumed samples: 2378752 | elapsed time per iteration (ms): 5635.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.071640E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:01:18.779150 | finish at 2025-09-10 11:44:24 + [2025-09-09 20:43:11] iteration 2324/ 11920 | consumed samples: 2379776 | elapsed time per iteration (ms): 5630.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.064508E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:00:31.587409 | finish at 2025-09-10 11:43:43 + [2025-09-09 20:43:17] iteration 2325/ 11920 | consumed samples: 2380800 | elapsed time per iteration (ms): 5632.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.079130E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:00:45.829382 | finish at 2025-09-10 11:44:03 + [2025-09-09 20:43:22] iteration 2326/ 11920 | consumed samples: 2381824 | elapsed time per iteration (ms): 5632.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.070055E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:00:39.311455 | finish at 2025-09-10 11:44:02 + [2025-09-09 20:43:28] iteration 2327/ 11920 | consumed samples: 2382848 | elapsed time per iteration (ms): 5627.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.059929E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:59:41.261949 | finish at 2025-09-10 11:43:09 + [2025-09-09 20:43:34] iteration 2328/ 11920 | consumed samples: 2383872 | elapsed time per iteration (ms): 5633.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.070969E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:00:33.539385 | finish at 2025-09-10 11:44:07 + [2025-09-09 20:43:39] iteration 2329/ 11920 | consumed samples: 2384896 | elapsed time per iteration (ms): 5635.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.062377E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:00:45.906883 | finish at 2025-09-10 11:44:25 + [2025-09-09 20:43:45] iteration 2330/ 11920 | consumed samples: 2385920 | elapsed time per iteration (ms): 5632.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.075036E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:00:10.763099 | finish at 2025-09-10 11:43:56 + [2025-09-09 20:43:51] iteration 2331/ 11920 | consumed samples: 2386944 | elapsed time per iteration (ms): 5945.0 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.080501E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:50:06.835266 | finish at 2025-09-10 12:33:58 + [2025-09-09 20:43:57] iteration 2332/ 11920 | consumed samples: 2387968 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.065640E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:58:36.377143 | finish at 2025-09-10 11:42:33 + [2025-09-09 20:44:02] iteration 2333/ 11920 | consumed samples: 2388992 | elapsed time per iteration (ms): 5626.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.052902E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:59:05.393895 | finish at 2025-09-10 11:43:08 + [2025-09-09 20:44:08] iteration 2334/ 11920 | consumed samples: 2390016 | elapsed time per iteration (ms): 5823.0 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.84% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.077433E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:30:19.501538 | finish at 2025-09-10 12:14:28 + [2025-09-09 20:44:14] iteration 2335/ 11920 | consumed samples: 2391040 | elapsed time per iteration (ms): 5957.9 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.074401E+00 | loss scale: 1.0 | grad norm: 0.240 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:51:46.549652 | finish at 2025-09-10 12:36:01 + [2025-09-09 20:44:20] iteration 2336/ 11920 | consumed samples: 2392064 | elapsed time per iteration (ms): 5629.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.078922E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:59:08.241821 | finish at 2025-09-10 11:43:28 + [2025-09-09 20:44:26] iteration 2337/ 11920 | consumed samples: 2393088 | elapsed time per iteration (ms): 5952.4 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.065312E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:50:42.198473 | finish at 2025-09-10 12:35:08 + [2025-09-09 20:44:31] iteration 2338/ 11920 | consumed samples: 2394112 | elapsed time per iteration (ms): 5629.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.067434E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:58:56.807932 | finish at 2025-09-10 11:43:28 + [2025-09-09 20:44:37] iteration 2339/ 11920 | consumed samples: 2395136 | elapsed time per iteration (ms): 5638.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.076440E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:00:23.925638 | finish at 2025-09-10 11:45:01 + [2025-09-09 20:44:42] iteration 2340/ 11920 | consumed samples: 2396160 | elapsed time per iteration (ms): 5639.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.067484E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:00:25.940838 | finish at 2025-09-10 11:45:08 + [2025-09-09 20:44:48] iteration 2341/ 11920 | consumed samples: 2397184 | elapsed time per iteration (ms): 5630.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.053794E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:58:51.821959 | finish at 2025-09-10 11:43:40 + [2025-09-09 20:44:54] iteration 2342/ 11920 | consumed samples: 2398208 | elapsed time per iteration (ms): 5619.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.064684E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:57:01.672602 | finish at 2025-09-10 11:41:55 + [2025-09-09 20:44:59] iteration 2343/ 11920 | consumed samples: 2399232 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.068453E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:57:14.482095 | finish at 2025-09-10 11:42:14 + [2025-09-09 20:45:05] iteration 2344/ 11920 | consumed samples: 2400256 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.065430E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:57:16.541204 | finish at 2025-09-10 11:42:22 + [2025-09-09 20:45:11] iteration 2345/ 11920 | consumed samples: 2401280 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.072185E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:57:19.575773 | finish at 2025-09-10 11:42:30 + [2025-09-09 20:45:16] iteration 2346/ 11920 | consumed samples: 2402304 | elapsed time per iteration (ms): 5619.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.077624E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:56:37.141037 | finish at 2025-09-10 11:41:53 + [2025-09-09 20:45:22] iteration 2347/ 11920 | consumed samples: 2403328 | elapsed time per iteration (ms): 5629.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.071177E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:58:08.548251 | finish at 2025-09-10 11:43:30 + [2025-09-09 20:45:27] iteration 2348/ 11920 | consumed samples: 2404352 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.056297E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:57:17.707500 | finish at 2025-09-10 11:42:45 + [2025-09-09 20:45:33] iteration 2349/ 11920 | consumed samples: 2405376 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.068103E+00 | loss scale: 1.0 | grad norm: 0.275 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:56:30.659594 | finish at 2025-09-10 11:42:04 + [2025-09-09 20:45:39] iteration 2350/ 11920 | consumed samples: 2406400 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.075886E+00 | loss scale: 1.0 | grad norm: 0.259 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:57:14.382727 | finish at 2025-09-10 11:42:53 + [2025-09-09 20:45:44] iteration 2351/ 11920 | consumed samples: 2407424 | elapsed time per iteration (ms): 5633.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.068214E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:58:29.677348 | finish at 2025-09-10 11:44:14 + [2025-09-09 20:45:50] iteration 2352/ 11920 | consumed samples: 2408448 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.100425E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:57:00.323929 | finish at 2025-09-10 11:42:50 + [2025-09-09 20:45:56] iteration 2353/ 11920 | consumed samples: 2409472 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.066223E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:56:37.573518 | finish at 2025-09-10 11:42:33 + [2025-09-09 20:46:01] iteration 2354/ 11920 | consumed samples: 2410496 | elapsed time per iteration (ms): 5645.9 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.084417E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:00:08.223363 | finish at 2025-09-10 11:46:09 + [2025-09-09 20:46:07] iteration 2355/ 11920 | consumed samples: 2411520 | elapsed time per iteration (ms): 5633.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.068592E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:58:04.916470 | finish at 2025-09-10 11:44:12 + [2025-09-09 20:46:13] iteration 2356/ 11920 | consumed samples: 2412544 | elapsed time per iteration (ms): 5640.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.055971E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:59:02.331425 | finish at 2025-09-10 11:45:15 + [2025-09-09 20:46:18] iteration 2357/ 11920 | consumed samples: 2413568 | elapsed time per iteration (ms): 5632.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.063591E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:57:40.370666 | finish at 2025-09-10 11:43:59 + [2025-09-09 20:46:24] iteration 2358/ 11920 | consumed samples: 2414592 | elapsed time per iteration (ms): 5629.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.064919E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:57:05.411691 | finish at 2025-09-10 11:43:29 + [2025-09-09 20:46:29] iteration 2359/ 11920 | consumed samples: 2415616 | elapsed time per iteration (ms): 5642.2 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.061752E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:59:04.764120 | finish at 2025-09-10 11:45:34 + [2025-09-09 20:46:35] iteration 2360/ 11920 | consumed samples: 2416640 | elapsed time per iteration (ms): 5983.6 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.072837E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:53:23.322182 | finish at 2025-09-10 12:39:59 + [2025-09-09 20:46:41] iteration 2361/ 11920 | consumed samples: 2417664 | elapsed time per iteration (ms): 5632.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.070558E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:57:24.868306 | finish at 2025-09-10 11:44:06 + [2025-09-09 20:46:47] iteration 2362/ 11920 | consumed samples: 2418688 | elapsed time per iteration (ms): 5636.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.076894E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:57:49.071799 | finish at 2025-09-10 11:44:36 + [2025-09-09 20:46:53] iteration 2363/ 11920 | consumed samples: 2419712 | elapsed time per iteration (ms): 5943.4 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.048658E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:46:40.792639 | finish at 2025-09-10 12:33:33 + [2025-09-09 20:46:58] iteration 2364/ 11920 | consumed samples: 2420736 | elapsed time per iteration (ms): 5632.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.071278E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:57:07.138023 | finish at 2025-09-10 11:44:05 + [2025-09-09 20:47:04] iteration 2365/ 11920 | consumed samples: 2421760 | elapsed time per iteration (ms): 5633.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.070235E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:57:09.184653 | finish at 2025-09-10 11:44:13 + [2025-09-09 20:47:10] iteration 2366/ 11920 | consumed samples: 2422784 | elapsed time per iteration (ms): 5632.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.069764E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:56:48.205155 | finish at 2025-09-10 11:43:58 + [2025-09-09 20:47:15] iteration 2367/ 11920 | consumed samples: 2423808 | elapsed time per iteration (ms): 5644.6 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.076123E+00 | loss scale: 1.0 | grad norm: 0.240 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:58:43.024694 | finish at 2025-09-10 11:45:58 + [2025-09-09 20:47:21] iteration 2368/ 11920 | consumed samples: 2424832 | elapsed time per iteration (ms): 5628.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.086656E+00 | loss scale: 1.0 | grad norm: 0.249 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:56:04.663914 | finish at 2025-09-10 11:43:25 + [2025-09-09 20:47:26] iteration 2369/ 11920 | consumed samples: 2425856 | elapsed time per iteration (ms): 5634.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.075796E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:56:50.519049 | finish at 2025-09-10 11:44:17 + [2025-09-09 20:47:32] iteration 2370/ 11920 | consumed samples: 2426880 | elapsed time per iteration (ms): 5645.7 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.057361E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:58:36.072762 | finish at 2025-09-10 11:46:08 + [2025-09-09 20:47:38] iteration 2371/ 11920 | consumed samples: 2427904 | elapsed time per iteration (ms): 5632.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.072958E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:56:23.159585 | finish at 2025-09-10 11:44:01 + [2025-09-09 20:47:44] iteration 2372/ 11920 | consumed samples: 2428928 | elapsed time per iteration (ms): 5986.3 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.061766E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:52:37.640775 | finish at 2025-09-10 12:40:21 + [2025-09-09 20:47:49] iteration 2373/ 11920 | consumed samples: 2429952 | elapsed time per iteration (ms): 5639.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.066588E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:57:23.460360 | finish at 2025-09-10 11:45:13 + [2025-09-09 20:47:55] iteration 2374/ 11920 | consumed samples: 2430976 | elapsed time per iteration (ms): 5629.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.062879E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:55:34.383438 | finish at 2025-09-10 11:43:29 + [2025-09-09 20:48:01] iteration 2375/ 11920 | consumed samples: 2432000 | elapsed time per iteration (ms): 5843.9 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.059714E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:29:39.911383 | finish at 2025-09-10 12:17:41 + [2025-09-09 20:48:07] iteration 2376/ 11920 | consumed samples: 2433024 | elapsed time per iteration (ms): 6326.0 | throughput per GPU (TFLOP/s/GPU): 71.4 | MFU 7.22% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.059300E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:46:15.227715 | finish at 2025-09-10 13:34:22 + [2025-09-09 20:48:13] iteration 2377/ 11920 | consumed samples: 2434048 | elapsed time per iteration (ms): 5922.9 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.073703E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:42:02.018186 | finish at 2025-09-10 12:30:15 + [2025-09-09 20:48:19] iteration 2378/ 11920 | consumed samples: 2435072 | elapsed time per iteration (ms): 6227.0 | throughput per GPU (TFLOP/s/GPU): 72.5 | MFU 7.33% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.057895E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:30:18.386605 | finish at 2025-09-10 13:18:38 + [2025-09-09 20:48:25] iteration 2379/ 11920 | consumed samples: 2436096 | elapsed time per iteration (ms): 5627.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.061394E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:54:48.315696 | finish at 2025-09-10 11:43:13 + [2025-09-09 20:48:31] iteration 2380/ 11920 | consumed samples: 2437120 | elapsed time per iteration (ms): 5847.8 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.073092E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:29:48.348784 | finish at 2025-09-10 12:18:19 + [2025-09-09 20:48:36] iteration 2381/ 11920 | consumed samples: 2438144 | elapsed time per iteration (ms): 5636.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.055944E+00 | loss scale: 1.0 | grad norm: 0.266 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:56:04.616496 | finish at 2025-09-10 11:44:41 + [2025-09-09 20:48:42] iteration 2382/ 11920 | consumed samples: 2439168 | elapsed time per iteration (ms): 6041.8 | throughput per GPU (TFLOP/s/GPU): 74.7 | MFU 7.56% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.052993E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:00:26.936481 | finish at 2025-09-10 12:49:09 + [2025-09-09 20:48:49] iteration 2383/ 11920 | consumed samples: 2440192 | elapsed time per iteration (ms): 6182.1 | throughput per GPU (TFLOP/s/GPU): 73.0 | MFU 7.38% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.048459E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:22:38.574559 | finish at 2025-09-10 13:11:27 + [2025-09-09 20:48:54] iteration 2384/ 11920 | consumed samples: 2441216 | elapsed time per iteration (ms): 5634.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.070875E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:55:30.605896 | finish at 2025-09-10 11:44:25 +(min, max) time across ranks (ms): + save-checkpoint ................................: (6322.54, 6322.93) + [2025-09-09 20:49:06] iteration 2385/ 11920 | consumed samples: 2442240 | elapsed time per iteration (ms): 5617.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.059540E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:52:39.248556 | finish at 2025-09-10 11:41:45 + [2025-09-09 20:49:12] iteration 2386/ 11920 | consumed samples: 2443264 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.078814E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:53:03.158780 | finish at 2025-09-10 11:42:15 + [2025-09-09 20:49:17] iteration 2387/ 11920 | consumed samples: 2444288 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.061209E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:52:59.318199 | finish at 2025-09-10 11:42:17 + [2025-09-09 20:49:23] iteration 2388/ 11920 | consumed samples: 2445312 | elapsed time per iteration (ms): 5629.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.068466E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:54:15.350251 | finish at 2025-09-10 11:43:38 + [2025-09-09 20:49:29] iteration 2389/ 11920 | consumed samples: 2446336 | elapsed time per iteration (ms): 5616.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.053027E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:52:12.326232 | finish at 2025-09-10 11:41:41 + [2025-09-09 20:49:34] iteration 2390/ 11920 | consumed samples: 2447360 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.049656E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:53:23.814278 | finish at 2025-09-10 11:42:58 + [2025-09-09 20:49:40] iteration 2391/ 11920 | consumed samples: 2448384 | elapsed time per iteration (ms): 5627.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.062807E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:53:47.112973 | finish at 2025-09-10 11:43:27 + [2025-09-09 20:49:46] iteration 2392/ 11920 | consumed samples: 2449408 | elapsed time per iteration (ms): 5617.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.080791E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:52:06.509686 | finish at 2025-09-10 11:41:52 + [2025-09-09 20:49:51] iteration 2393/ 11920 | consumed samples: 2450432 | elapsed time per iteration (ms): 5642.4 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.061877E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:55:55.542548 | finish at 2025-09-10 11:45:47 + [2025-09-09 20:49:57] iteration 2394/ 11920 | consumed samples: 2451456 | elapsed time per iteration (ms): 5631.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.063865E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:54:01.796700 | finish at 2025-09-10 11:43:59 + [2025-09-09 20:50:02] iteration 2395/ 11920 | consumed samples: 2452480 | elapsed time per iteration (ms): 5627.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.062277E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:53:23.893322 | finish at 2025-09-10 11:43:26 + [2025-09-09 20:50:08] iteration 2396/ 11920 | consumed samples: 2453504 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.067839E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:52:30.197199 | finish at 2025-09-10 11:42:38 + [2025-09-09 20:50:14] iteration 2397/ 11920 | consumed samples: 2454528 | elapsed time per iteration (ms): 5955.5 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.060188E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:45:13.796311 | finish at 2025-09-10 12:35:28 + [2025-09-09 20:50:20] iteration 2398/ 11920 | consumed samples: 2455552 | elapsed time per iteration (ms): 5641.7 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.060825E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:55:20.646807 | finish at 2025-09-10 11:45:40 + [2025-09-09 20:50:25] iteration 2399/ 11920 | consumed samples: 2456576 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.046697E+00 | loss scale: 1.0 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:52:25.709713 | finish at 2025-09-10 11:42:51 + [2025-09-09 20:50:31] iteration 2400/ 11920 | consumed samples: 2457600 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.053291E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:52:00.799732 | finish at 2025-09-10 11:42:32 + [2025-09-09 20:50:37] iteration 2401/ 11920 | consumed samples: 2458624 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.049603E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:52:13.556263 | finish at 2025-09-10 11:42:50 + [2025-09-09 20:50:42] iteration 2402/ 11920 | consumed samples: 2459648 | elapsed time per iteration (ms): 5616.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.049472E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:50:59.568430 | finish at 2025-09-10 11:41:42 + [2025-09-09 20:50:48] iteration 2403/ 11920 | consumed samples: 2460672 | elapsed time per iteration (ms): 6049.9 | throughput per GPU (TFLOP/s/GPU): 74.6 | MFU 7.55% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.059229E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:59:36.796716 | finish at 2025-09-10 12:50:25 + [2025-09-09 20:50:54] iteration 2404/ 11920 | consumed samples: 2461696 | elapsed time per iteration (ms): 5629.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.047988E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:52:47.861795 | finish at 2025-09-10 11:43:42 + [2025-09-09 20:50:59] iteration 2405/ 11920 | consumed samples: 2462720 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.047522E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:52:42.543346 | finish at 2025-09-10 11:43:42 + [2025-09-09 20:51:05] iteration 2406/ 11920 | consumed samples: 2463744 | elapsed time per iteration (ms): 5627.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.038932E+00 | loss scale: 1.0 | grad norm: 0.130 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:52:17.601643 | finish at 2025-09-10 11:43:23 + [2025-09-09 20:51:11] iteration 2407/ 11920 | consumed samples: 2464768 | elapsed time per iteration (ms): 5973.9 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.044152E+00 | loss scale: 1.0 | grad norm: 0.135 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:47:09.670633 | finish at 2025-09-10 12:38:21 + [2025-09-09 20:51:17] iteration 2408/ 11920 | consumed samples: 2465792 | elapsed time per iteration (ms): 5618.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.054455E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:50:38.194042 | finish at 2025-09-10 11:41:55 + [2025-09-09 20:51:23] iteration 2409/ 11920 | consumed samples: 2466816 | elapsed time per iteration (ms): 5943.9 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.064367E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:42:12.363635 | finish at 2025-09-10 12:33:35 + [2025-09-09 20:51:28] iteration 2410/ 11920 | consumed samples: 2467840 | elapsed time per iteration (ms): 5618.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.041257E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:50:33.789647 | finish at 2025-09-10 11:42:02 + [2025-09-09 20:51:34] iteration 2411/ 11920 | consumed samples: 2468864 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.053396E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:50:43.260917 | finish at 2025-09-10 11:42:17 + [2025-09-09 20:51:39] iteration 2412/ 11920 | consumed samples: 2469888 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.066406E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:50:35.149330 | finish at 2025-09-10 11:42:15 + [2025-09-09 20:51:45] iteration 2413/ 11920 | consumed samples: 2470912 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.047190E+00 | loss scale: 1.0 | grad norm: 0.265 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:50:46.377286 | finish at 2025-09-10 11:42:31 + [2025-09-09 20:51:51] iteration 2414/ 11920 | consumed samples: 2471936 | elapsed time per iteration (ms): 5937.7 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.062815E+00 | loss scale: 1.0 | grad norm: 0.253 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:40:43.774249 | finish at 2025-09-10 12:32:35 + [2025-09-09 20:51:57] iteration 2415/ 11920 | consumed samples: 2472960 | elapsed time per iteration (ms): 5641.0 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.063717E+00 | loss scale: 1.0 | grad norm: 0.243 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:53:38.065629 | finish at 2025-09-10 11:45:35 + [2025-09-09 20:52:02] iteration 2416/ 11920 | consumed samples: 2473984 | elapsed time per iteration (ms): 5626.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.058049E+00 | loss scale: 1.0 | grad norm: 0.264 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:51:11.488266 | finish at 2025-09-10 11:43:14 + [2025-09-09 20:52:08] iteration 2417/ 11920 | consumed samples: 2475008 | elapsed time per iteration (ms): 5630.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.071131E+00 | loss scale: 1.0 | grad norm: 0.247 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:51:47.938219 | finish at 2025-09-10 11:43:56 + [2025-09-09 20:52:14] iteration 2418/ 11920 | consumed samples: 2476032 | elapsed time per iteration (ms): 5969.4 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.067975E+00 | loss scale: 1.0 | grad norm: 0.238 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:45:21.695747 | finish at 2025-09-10 12:37:36 + [2025-09-09 20:52:20] iteration 2419/ 11920 | consumed samples: 2477056 | elapsed time per iteration (ms): 5622.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.065702E+00 | loss scale: 1.0 | grad norm: 0.265 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:50:19.759308 | finish at 2025-09-10 11:42:39 + [2025-09-09 20:52:25] iteration 2420/ 11920 | consumed samples: 2478080 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.068193E+00 | loss scale: 1.0 | grad norm: 0.293 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:50:36.107039 | finish at 2025-09-10 11:43:01 + [2025-09-09 20:52:31] iteration 2421/ 11920 | consumed samples: 2479104 | elapsed time per iteration (ms): 5627.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.074963E+00 | loss scale: 1.0 | grad norm: 0.310 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:50:57.747368 | finish at 2025-09-10 11:43:29 + [2025-09-09 20:52:36] iteration 2422/ 11920 | consumed samples: 2480128 | elapsed time per iteration (ms): 5628.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.061589E+00 | loss scale: 1.0 | grad norm: 0.305 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:50:56.580709 | finish at 2025-09-10 11:43:33 + [2025-09-09 20:52:42] iteration 2423/ 11920 | consumed samples: 2481152 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.068941E+00 | loss scale: 1.0 | grad norm: 0.258 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:50:20.206112 | finish at 2025-09-10 11:43:02 + [2025-09-09 20:52:48] iteration 2424/ 11920 | consumed samples: 2482176 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.046870E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:50:16.670849 | finish at 2025-09-10 11:43:04 + [2025-09-09 20:52:53] iteration 2425/ 11920 | consumed samples: 2483200 | elapsed time per iteration (ms): 5639.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.059418E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:52:29.503255 | finish at 2025-09-10 11:45:23 + [2025-09-09 20:52:59] iteration 2426/ 11920 | consumed samples: 2484224 | elapsed time per iteration (ms): 5984.7 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.067055E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:46:58.399668 | finish at 2025-09-10 12:39:58 + [2025-09-09 20:53:05] iteration 2427/ 11920 | consumed samples: 2485248 | elapsed time per iteration (ms): 5637.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.073984E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:51:51.915051 | finish at 2025-09-10 11:44:57 + [2025-09-09 20:53:11] iteration 2428/ 11920 | consumed samples: 2486272 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.082614E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:49:21.296803 | finish at 2025-09-10 11:42:32 + [2025-09-09 20:53:16] iteration 2429/ 11920 | consumed samples: 2487296 | elapsed time per iteration (ms): 5626.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.070985E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:50:01.343540 | finish at 2025-09-10 11:43:18 + [2025-09-09 20:53:22] iteration 2430/ 11920 | consumed samples: 2488320 | elapsed time per iteration (ms): 5628.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.058270E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:50:18.098578 | finish at 2025-09-10 11:43:40 + [2025-09-09 20:53:27] iteration 2431/ 11920 | consumed samples: 2489344 | elapsed time per iteration (ms): 5630.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.052499E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:50:31.437271 | finish at 2025-09-10 11:43:59 + [2025-09-09 20:53:33] iteration 2432/ 11920 | consumed samples: 2490368 | elapsed time per iteration (ms): 5984.3 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.070613E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:46:19.003502 | finish at 2025-09-10 12:39:52 + [2025-09-09 20:53:39] iteration 2433/ 11920 | consumed samples: 2491392 | elapsed time per iteration (ms): 5932.7 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.070369E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:38:03.594247 | finish at 2025-09-10 12:31:43 + [2025-09-09 20:53:45] iteration 2434/ 11920 | consumed samples: 2492416 | elapsed time per iteration (ms): 5857.3 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.064022E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:26:01.893620 | finish at 2025-09-10 12:19:47 + [2025-09-09 20:53:51] iteration 2435/ 11920 | consumed samples: 2493440 | elapsed time per iteration (ms): 5863.4 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.059944E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:26:53.964396 | finish at 2025-09-10 12:20:45 + [2025-09-09 20:53:57] iteration 2436/ 11920 | consumed samples: 2494464 | elapsed time per iteration (ms): 5633.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.050106E+00 | loss scale: 1.0 | grad norm: 0.301 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:50:26.633880 | finish at 2025-09-10 11:44:23 + [2025-09-09 20:54:03] iteration 2437/ 11920 | consumed samples: 2495488 | elapsed time per iteration (ms): 5974.5 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.053040E+00 | loss scale: 1.0 | grad norm: 0.279 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:44:15.988500 | finish at 2025-09-10 12:38:19 + [2025-09-09 20:54:09] iteration 2438/ 11920 | consumed samples: 2496512 | elapsed time per iteration (ms): 5953.3 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.078254E+00 | loss scale: 1.0 | grad norm: 0.265 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:40:49.507089 | finish at 2025-09-10 12:34:58 + [2025-09-09 20:54:14] iteration 2439/ 11920 | consumed samples: 2497536 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.071232E+00 | loss scale: 1.0 | grad norm: 0.319 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:49:31.760606 | finish at 2025-09-10 11:43:46 + [2025-09-09 20:54:20] iteration 2440/ 11920 | consumed samples: 2498560 | elapsed time per iteration (ms): 5628.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.056532E+00 | loss scale: 1.0 | grad norm: 0.284 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:49:19.361944 | finish at 2025-09-10 11:43:39 + [2025-09-09 20:54:26] iteration 2441/ 11920 | consumed samples: 2499584 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.056865E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:48:26.678490 | finish at 2025-09-10 11:42:52 + [2025-09-09 20:54:31] iteration 2442/ 11920 | consumed samples: 2500608 | elapsed time per iteration (ms): 5628.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.064028E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:49:09.306872 | finish at 2025-09-10 11:43:40 + [2025-09-09 20:54:37] iteration 2443/ 11920 | consumed samples: 2501632 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.063090E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:48:13.684581 | finish at 2025-09-10 11:42:50 + [2025-09-09 20:54:42] iteration 2444/ 11920 | consumed samples: 2502656 | elapsed time per iteration (ms): 5622.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.060908E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:47:59.340383 | finish at 2025-09-10 11:42:42 + [2025-09-09 20:54:48] iteration 2445/ 11920 | consumed samples: 2503680 | elapsed time per iteration (ms): 5632.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.069795E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:49:25.135688 | finish at 2025-09-10 11:44:13 + [2025-09-09 20:54:54] iteration 2446/ 11920 | consumed samples: 2504704 | elapsed time per iteration (ms): 5630.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.072943E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:49:04.048927 | finish at 2025-09-10 11:43:58 + [2025-09-09 20:55:00] iteration 2447/ 11920 | consumed samples: 2505728 | elapsed time per iteration (ms): 5959.8 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.066534E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:40:57.481316 | finish at 2025-09-10 12:35:57 + [2025-09-09 20:55:05] iteration 2448/ 11920 | consumed samples: 2506752 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.060079E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:47:11.270386 | finish at 2025-09-10 11:42:17 + [2025-09-09 20:55:11] iteration 2449/ 11920 | consumed samples: 2507776 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.055826E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:47:33.971148 | finish at 2025-09-10 11:42:45 + [2025-09-09 20:55:16] iteration 2450/ 11920 | consumed samples: 2508800 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.049827E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:47:33.320031 | finish at 2025-09-10 11:42:50 + [2025-09-09 20:55:22] iteration 2451/ 11920 | consumed samples: 2509824 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.059097E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:47:36.216788 | finish at 2025-09-10 11:42:58 + [2025-09-09 20:55:28] iteration 2452/ 11920 | consumed samples: 2510848 | elapsed time per iteration (ms): 5836.6 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.067217E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:21:01.028741 | finish at 2025-09-10 12:16:29 + [2025-09-09 20:55:34] iteration 2453/ 11920 | consumed samples: 2511872 | elapsed time per iteration (ms): 5995.4 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.062563E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:45:58.516801 | finish at 2025-09-10 12:41:32 + [2025-09-09 20:55:40] iteration 2454/ 11920 | consumed samples: 2512896 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.053378E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:46:40.706358 | finish at 2025-09-10 11:42:20 + [2025-09-09 20:55:45] iteration 2455/ 11920 | consumed samples: 2513920 | elapsed time per iteration (ms): 5889.7 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.057302E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:29:06.111442 | finish at 2025-09-10 12:24:52 + [2025-09-09 20:55:51] iteration 2456/ 11920 | consumed samples: 2514944 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.056972E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:46:38.126019 | finish at 2025-09-10 11:42:29 + [2025-09-09 20:55:57] iteration 2457/ 11920 | consumed samples: 2515968 | elapsed time per iteration (ms): 5929.0 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.065974E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:35:06.563756 | finish at 2025-09-10 12:31:04 + [2025-09-09 20:56:03] iteration 2458/ 11920 | consumed samples: 2516992 | elapsed time per iteration (ms): 6313.8 | throughput per GPU (TFLOP/s/GPU): 71.5 | MFU 7.23% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.062322E+00 | loss scale: 1.0 | grad norm: 0.256 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:35:40.833614 | finish at 2025-09-10 13:31:44 + [2025-09-09 20:56:09] iteration 2459/ 11920 | consumed samples: 2518016 | elapsed time per iteration (ms): 5627.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.065431E+00 | loss scale: 1.0 | grad norm: 0.279 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:47:24.703656 | finish at 2025-09-10 11:43:34 + [2025-09-09 20:56:15] iteration 2460/ 11920 | consumed samples: 2519040 | elapsed time per iteration (ms): 5638.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.082253E+00 | loss scale: 1.0 | grad norm: 0.301 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:49:02.521591 | finish at 2025-09-10 11:45:17 + [2025-09-09 20:56:20] iteration 2461/ 11920 | consumed samples: 2520064 | elapsed time per iteration (ms): 5634.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.061344E+00 | loss scale: 1.0 | grad norm: 0.273 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:48:20.605678 | finish at 2025-09-10 11:44:41 + [2025-09-09 20:56:26] iteration 2462/ 11920 | consumed samples: 2521088 | elapsed time per iteration (ms): 5924.4 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.082901E+00 | loss scale: 1.0 | grad norm: 0.263 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:33:53.091065 | finish at 2025-09-10 12:30:19 + [2025-09-09 20:56:32] iteration 2463/ 11920 | consumed samples: 2522112 | elapsed time per iteration (ms): 5627.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.077049E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:47:03.306253 | finish at 2025-09-10 11:43:35 + [2025-09-09 20:56:37] iteration 2464/ 11920 | consumed samples: 2523136 | elapsed time per iteration (ms): 5627.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.064179E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:46:53.128773 | finish at 2025-09-10 11:43:31 + [2025-09-09 20:56:43] iteration 2465/ 11920 | consumed samples: 2524160 | elapsed time per iteration (ms): 5618.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.080596E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:45:20.850301 | finish at 2025-09-10 11:42:04 + [2025-09-09 20:56:49] iteration 2466/ 11920 | consumed samples: 2525184 | elapsed time per iteration (ms): 5623.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.050406E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:46:00.871199 | finish at 2025-09-10 11:42:50 + [2025-09-09 20:56:54] iteration 2467/ 11920 | consumed samples: 2526208 | elapsed time per iteration (ms): 5633.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.068099E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:47:31.245204 | finish at 2025-09-10 11:44:26 + [2025-09-09 20:57:00] iteration 2468/ 11920 | consumed samples: 2527232 | elapsed time per iteration (ms): 5630.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.051773E+00 | loss scale: 1.0 | grad norm: 0.229 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:46:54.824181 | finish at 2025-09-10 11:43:55 + [2025-09-09 20:57:06] iteration 2469/ 11920 | consumed samples: 2528256 | elapsed time per iteration (ms): 5637.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.049055E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:48:00.175188 | finish at 2025-09-10 11:45:06 + [2025-09-09 20:57:11] iteration 2470/ 11920 | consumed samples: 2529280 | elapsed time per iteration (ms): 5626.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.061795E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:46:13.573744 | finish at 2025-09-10 11:43:25 + [2025-09-09 20:57:17] iteration 2471/ 11920 | consumed samples: 2530304 | elapsed time per iteration (ms): 5845.7 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.033305E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:20:36.271857 | finish at 2025-09-10 12:17:53 + [2025-09-09 20:57:23] iteration 2472/ 11920 | consumed samples: 2531328 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.053434E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:45:28.765665 | finish at 2025-09-10 11:42:51 + [2025-09-09 20:57:28] iteration 2473/ 11920 | consumed samples: 2532352 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.052944E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:45:22.439653 | finish at 2025-09-10 11:42:51 + [2025-09-09 20:57:34] iteration 2474/ 11920 | consumed samples: 2533376 | elapsed time per iteration (ms): 5623.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.046226E+00 | loss scale: 1.0 | grad norm: 0.130 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:45:16.014698 | finish at 2025-09-10 11:42:50 + [2025-09-09 20:57:40] iteration 2475/ 11920 | consumed samples: 2534400 | elapsed time per iteration (ms): 6314.3 | throughput per GPU (TFLOP/s/GPU): 71.5 | MFU 7.23% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.051094E+00 | loss scale: 1.0 | grad norm: 0.125 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:33:58.093430 | finish at 2025-09-10 13:31:38 + [2025-09-09 20:57:46] iteration 2476/ 11920 | consumed samples: 2535424 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.055431E+00 | loss scale: 1.0 | grad norm: 0.135 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:45:20.439763 | finish at 2025-09-10 11:43:06 + [2025-09-09 20:57:52] iteration 2477/ 11920 | consumed samples: 2536448 | elapsed time per iteration (ms): 5847.5 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.064919E+00 | loss scale: 1.0 | grad norm: 0.127 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:20:18.263007 | finish at 2025-09-10 12:18:10 + [2025-09-09 20:57:58] iteration 2478/ 11920 | consumed samples: 2537472 | elapsed time per iteration (ms): 6040.6 | throughput per GPU (TFLOP/s/GPU): 74.7 | MFU 7.56% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.046758E+00 | loss scale: 1.0 | grad norm: 0.133 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:50:34.933820 | finish at 2025-09-10 12:48:33 + [2025-09-09 20:58:03] iteration 2479/ 11920 | consumed samples: 2538496 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.060330E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:45:27.789709 | finish at 2025-09-10 11:43:31 + [2025-09-09 20:58:09] iteration 2480/ 11920 | consumed samples: 2539520 | elapsed time per iteration (ms): 5630.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.038842E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:45:48.281403 | finish at 2025-09-10 11:43:57 + [2025-09-09 20:58:15] iteration 2481/ 11920 | consumed samples: 2540544 | elapsed time per iteration (ms): 5619.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.065096E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:43:57.990402 | finish at 2025-09-10 11:42:13 + [2025-09-09 20:58:20] iteration 2482/ 11920 | consumed samples: 2541568 | elapsed time per iteration (ms): 5626.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.067348E+00 | loss scale: 1.0 | grad norm: 0.249 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:44:59.850210 | finish at 2025-09-10 11:43:20 + [2025-09-09 20:58:26] iteration 2483/ 11920 | consumed samples: 2542592 | elapsed time per iteration (ms): 5625.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.049031E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:44:47.111922 | finish at 2025-09-10 11:43:13 + [2025-09-09 20:58:31] iteration 2484/ 11920 | consumed samples: 2543616 | elapsed time per iteration (ms): 5619.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.050171E+00 | loss scale: 1.0 | grad norm: 0.259 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:43:40.939847 | finish at 2025-09-10 11:42:12 + [2025-09-09 20:58:37] iteration 2485/ 11920 | consumed samples: 2544640 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.057770E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:44:00.730959 | finish at 2025-09-10 11:42:38 + [2025-09-09 20:58:43] iteration 2486/ 11920 | consumed samples: 2545664 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.042912E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:44:14.041121 | finish at 2025-09-10 11:42:57 + [2025-09-09 20:58:48] iteration 2487/ 11920 | consumed samples: 2546688 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.054183E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:44:13.862250 | finish at 2025-09-10 11:43:02 + [2025-09-09 20:58:54] iteration 2488/ 11920 | consumed samples: 2547712 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.036378E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:44:12.560091 | finish at 2025-09-10 11:43:07 + [2025-09-09 20:59:00] iteration 2489/ 11920 | consumed samples: 2548736 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.057271E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:43:23.563539 | finish at 2025-09-10 11:42:23 + [2025-09-09 20:59:05] iteration 2490/ 11920 | consumed samples: 2549760 | elapsed time per iteration (ms): 5630.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.044487E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:44:59.264708 | finish at 2025-09-10 11:44:04 + [2025-09-09 20:59:11] iteration 2491/ 11920 | consumed samples: 2550784 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.043846E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:43:44.214074 | finish at 2025-09-10 11:42:55 + [2025-09-09 20:59:16] iteration 2492/ 11920 | consumed samples: 2551808 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.042688E+00 | loss scale: 1.0 | grad norm: 0.134 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:43:59.243431 | finish at 2025-09-10 11:43:16 + [2025-09-09 20:59:22] iteration 2493/ 11920 | consumed samples: 2552832 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.047289E+00 | loss scale: 1.0 | grad norm: 0.123 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:43:38.574717 | finish at 2025-09-10 11:43:01 + [2025-09-09 20:59:28] iteration 2494/ 11920 | consumed samples: 2553856 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.043526E+00 | loss scale: 1.0 | grad norm: 0.120 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:43:02.692499 | finish at 2025-09-10 11:42:30 + [2025-09-09 20:59:34] iteration 2495/ 11920 | consumed samples: 2554880 | elapsed time per iteration (ms): 5990.9 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.041924E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:41:04.639598 | finish at 2025-09-10 12:40:38 + [2025-09-09 20:59:40] iteration 2496/ 11920 | consumed samples: 2555904 | elapsed time per iteration (ms): 5953.8 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.042469E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:35:08.572655 | finish at 2025-09-10 12:34:48 + [2025-09-09 20:59:45] iteration 2497/ 11920 | consumed samples: 2556928 | elapsed time per iteration (ms): 5616.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.067147E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:42:05.010963 | finish at 2025-09-10 11:41:50 + [2025-09-09 20:59:51] iteration 2498/ 11920 | consumed samples: 2557952 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.045727E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:43:11.132526 | finish at 2025-09-10 11:43:02 + [2025-09-09 20:59:57] iteration 2499/ 11920 | consumed samples: 2558976 | elapsed time per iteration (ms): 5634.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.048680E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:44:41.694849 | finish at 2025-09-10 11:44:38 + [2025-09-09 21:00:02] iteration 2500/ 11920 | consumed samples: 2560000 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.043861E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:42:25.616155 | finish at 2025-09-10 11:42:28 + [2025-09-09 21:00:08] iteration 2501/ 11920 | consumed samples: 2561024 | elapsed time per iteration (ms): 5630.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.038854E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:43:52.258731 | finish at 2025-09-10 11:44:00 + [2025-09-09 21:00:13] iteration 2502/ 11920 | consumed samples: 2562048 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.055803E+00 | loss scale: 1.0 | grad norm: 0.258 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:42:39.032072 | finish at 2025-09-10 11:42:52 + [2025-09-09 21:00:19] iteration 2503/ 11920 | consumed samples: 2563072 | elapsed time per iteration (ms): 5628.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.047082E+00 | loss scale: 1.0 | grad norm: 0.271 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:43:22.261941 | finish at 2025-09-10 11:43:41 + [2025-09-09 21:00:25] iteration 2504/ 11920 | consumed samples: 2564096 | elapsed time per iteration (ms): 5925.6 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.051159E+00 | loss scale: 1.0 | grad norm: 0.262 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:29:55.589275 | finish at 2025-09-10 12:30:21 + [2025-09-09 21:00:31] iteration 2505/ 11920 | consumed samples: 2565120 | elapsed time per iteration (ms): 5631.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.051981E+00 | loss scale: 1.0 | grad norm: 0.265 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:43:35.600519 | finish at 2025-09-10 11:44:06 + [2025-09-09 21:00:36] iteration 2506/ 11920 | consumed samples: 2566144 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.052645E+00 | loss scale: 1.0 | grad norm: 0.291 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:42:57.617721 | finish at 2025-09-10 11:43:34 + [2025-09-09 21:00:42] iteration 2507/ 11920 | consumed samples: 2567168 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.055716E+00 | loss scale: 1.0 | grad norm: 0.287 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:42:34.938495 | finish at 2025-09-10 11:43:17 + [2025-09-09 21:00:47] iteration 2508/ 11920 | consumed samples: 2568192 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.052799E+00 | loss scale: 1.0 | grad norm: 0.279 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:41:45.646859 | finish at 2025-09-10 11:42:33 + [2025-09-09 21:00:53] iteration 2509/ 11920 | consumed samples: 2569216 | elapsed time per iteration (ms): 5631.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.052005E+00 | loss scale: 1.0 | grad norm: 0.266 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:43:14.440836 | finish at 2025-09-10 11:44:08 + [2025-09-09 21:00:59] iteration 2510/ 11920 | consumed samples: 2570240 | elapsed time per iteration (ms): 5632.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.061541E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:43:20.455825 | finish at 2025-09-10 11:44:19 + [2025-09-09 21:01:04] iteration 2511/ 11920 | consumed samples: 2571264 | elapsed time per iteration (ms): 5629.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.054263E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:42:51.168079 | finish at 2025-09-10 11:43:56 + [2025-09-09 21:01:10] iteration 2512/ 11920 | consumed samples: 2572288 | elapsed time per iteration (ms): 5639.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.049503E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:44:14.106995 | finish at 2025-09-10 11:45:24 + [2025-09-09 21:01:16] iteration 2513/ 11920 | consumed samples: 2573312 | elapsed time per iteration (ms): 5632.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.045161E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:43:02.553986 | finish at 2025-09-10 11:44:18 + [2025-09-09 21:01:21] iteration 2514/ 11920 | consumed samples: 2574336 | elapsed time per iteration (ms): 5624.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.042768E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:41:45.377182 | finish at 2025-09-10 11:43:07 + [2025-09-09 21:01:27] iteration 2515/ 11920 | consumed samples: 2575360 | elapsed time per iteration (ms): 5969.1 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.052379E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:35:39.280096 | finish at 2025-09-10 12:37:06 + [2025-09-09 21:01:33] iteration 2516/ 11920 | consumed samples: 2576384 | elapsed time per iteration (ms): 6060.7 | throughput per GPU (TFLOP/s/GPU): 74.5 | MFU 7.53% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.047679E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:49:54.353637 | finish at 2025-09-10 12:51:28 + [2025-09-09 21:01:39] iteration 2517/ 11920 | consumed samples: 2577408 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.028628E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:41:01.372390 | finish at 2025-09-10 11:42:40 + [2025-09-09 21:01:45] iteration 2518/ 11920 | consumed samples: 2578432 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.032090E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:40:54.441533 | finish at 2025-09-10 11:42:39 + [2025-09-09 21:01:50] iteration 2519/ 11920 | consumed samples: 2579456 | elapsed time per iteration (ms): 5949.5 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.048080E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:32:10.817704 | finish at 2025-09-10 12:34:01 + [2025-09-09 21:01:56] iteration 2520/ 11920 | consumed samples: 2580480 | elapsed time per iteration (ms): 5635.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.038834E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:42:51.411371 | finish at 2025-09-10 11:44:48 + [2025-09-09 21:02:02] iteration 2521/ 11920 | consumed samples: 2581504 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.060786E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:41:07.934104 | finish at 2025-09-10 11:43:10 + [2025-09-09 21:02:07] iteration 2522/ 11920 | consumed samples: 2582528 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.037351E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:40:51.289702 | finish at 2025-09-10 11:42:59 + [2025-09-09 21:02:13] iteration 2523/ 11920 | consumed samples: 2583552 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.059272E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:40:23.335768 | finish at 2025-09-10 11:42:36 + [2025-09-09 21:02:19] iteration 2524/ 11920 | consumed samples: 2584576 | elapsed time per iteration (ms): 5873.2 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.060732E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:19:44.725525 | finish at 2025-09-10 12:22:04 + [2025-09-09 21:02:24] iteration 2525/ 11920 | consumed samples: 2585600 | elapsed time per iteration (ms): 5629.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.028586E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:41:29.639983 | finish at 2025-09-10 11:43:54 + [2025-09-09 21:02:30] iteration 2526/ 11920 | consumed samples: 2586624 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.038800E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:39:52.740248 | finish at 2025-09-10 11:42:23 + [2025-09-09 21:02:36] iteration 2527/ 11920 | consumed samples: 2587648 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.049259E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:40:24.100709 | finish at 2025-09-10 11:43:00 + [2025-09-09 21:02:41] iteration 2528/ 11920 | consumed samples: 2588672 | elapsed time per iteration (ms): 5619.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.045591E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:39:33.981251 | finish at 2025-09-10 11:42:15 + [2025-09-09 21:02:47] iteration 2529/ 11920 | consumed samples: 2589696 | elapsed time per iteration (ms): 5635.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.047951E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:42:04.549601 | finish at 2025-09-10 11:44:52 + [2025-09-09 21:02:53] iteration 2530/ 11920 | consumed samples: 2590720 | elapsed time per iteration (ms): 5641.0 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.039957E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:42:49.048512 | finish at 2025-09-10 11:45:42 + [2025-09-09 21:02:58] iteration 2531/ 11920 | consumed samples: 2591744 | elapsed time per iteration (ms): 5630.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.039871E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:41:03.856398 | finish at 2025-09-10 11:44:02 + [2025-09-09 21:03:04] iteration 2532/ 11920 | consumed samples: 2592768 | elapsed time per iteration (ms): 5617.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.045386E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:38:55.658132 | finish at 2025-09-10 11:42:00 + [2025-09-09 21:03:09] iteration 2533/ 11920 | consumed samples: 2593792 | elapsed time per iteration (ms): 5630.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.044125E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:40:48.922976 | finish at 2025-09-10 11:43:58 + [2025-09-09 21:03:15] iteration 2534/ 11920 | consumed samples: 2594816 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.040132E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:39:51.308945 | finish at 2025-09-10 11:43:06 + [2025-09-09 21:03:21] iteration 2535/ 11920 | consumed samples: 2595840 | elapsed time per iteration (ms): 5628.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.055140E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:40:22.968906 | finish at 2025-09-10 11:43:44 + [2025-09-09 21:03:26] iteration 2536/ 11920 | consumed samples: 2596864 | elapsed time per iteration (ms): 5619.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.039561E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:38:55.416515 | finish at 2025-09-10 11:42:22 + [2025-09-09 21:03:32] iteration 2537/ 11920 | consumed samples: 2597888 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.041950E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:39:10.630739 | finish at 2025-09-10 11:42:43 + [2025-09-09 21:03:38] iteration 2538/ 11920 | consumed samples: 2598912 | elapsed time per iteration (ms): 5861.2 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.052947E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:16:29.524284 | finish at 2025-09-10 12:20:07 + [2025-09-09 21:03:43] iteration 2539/ 11920 | consumed samples: 2599936 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.050819E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:39:02.077502 | finish at 2025-09-10 11:42:46 + [2025-09-09 21:03:49] iteration 2540/ 11920 | consumed samples: 2600960 | elapsed time per iteration (ms): 5644.6 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.029744E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:42:26.110144 | finish at 2025-09-10 11:46:15 + [2025-09-09 21:03:55] iteration 2541/ 11920 | consumed samples: 2601984 | elapsed time per iteration (ms): 5625.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.038583E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:39:24.887046 | finish at 2025-09-10 11:43:20 + [2025-09-09 21:04:00] iteration 2542/ 11920 | consumed samples: 2603008 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.040301E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:38:29.921820 | finish at 2025-09-10 11:42:30 + [2025-09-09 21:04:06] iteration 2543/ 11920 | consumed samples: 2604032 | elapsed time per iteration (ms): 5616.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.025094E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:37:42.461018 | finish at 2025-09-10 11:41:48 + [2025-09-09 21:04:12] iteration 2544/ 11920 | consumed samples: 2605056 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.044234E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:38:35.685417 | finish at 2025-09-10 11:42:47 + [2025-09-09 21:04:17] iteration 2545/ 11920 | consumed samples: 2606080 | elapsed time per iteration (ms): 5618.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.038480E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:37:48.526769 | finish at 2025-09-10 11:42:06 + [2025-09-09 21:04:23] iteration 2546/ 11920 | consumed samples: 2607104 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.035197E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:38:14.356574 | finish at 2025-09-10 11:42:37 + [2025-09-09 21:04:28] iteration 2547/ 11920 | consumed samples: 2608128 | elapsed time per iteration (ms): 5618.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.038338E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:37:42.703253 | finish at 2025-09-10 11:42:11 + [2025-09-09 21:04:34] iteration 2548/ 11920 | consumed samples: 2609152 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.069215E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:38:38.295467 | finish at 2025-09-10 11:43:12 + [2025-09-09 21:04:40] iteration 2549/ 11920 | consumed samples: 2610176 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.050363E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:37:52.065659 | finish at 2025-09-10 11:42:32 + [2025-09-09 21:04:45] iteration 2550/ 11920 | consumed samples: 2611200 | elapsed time per iteration (ms): 5628.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.047071E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:38:56.299293 | finish at 2025-09-10 11:43:42 + [2025-09-09 21:04:51] iteration 2551/ 11920 | consumed samples: 2612224 | elapsed time per iteration (ms): 5629.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.026586E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:39:03.376620 | finish at 2025-09-10 11:43:54 + [2025-09-09 21:04:57] iteration 2552/ 11920 | consumed samples: 2613248 | elapsed time per iteration (ms): 5649.6 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.045477E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:42:05.660788 | finish at 2025-09-10 11:47:02 + [2025-09-09 21:05:02] iteration 2553/ 11920 | consumed samples: 2614272 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.045866E+00 | loss scale: 1.0 | grad norm: 0.256 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:37:26.842433 | finish at 2025-09-10 11:42:29 + [2025-09-09 21:05:08] iteration 2554/ 11920 | consumed samples: 2615296 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.050846E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:37:38.862898 | finish at 2025-09-10 11:42:47 + [2025-09-09 21:05:13] iteration 2555/ 11920 | consumed samples: 2616320 | elapsed time per iteration (ms): 5618.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.039705E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:37:00.878497 | finish at 2025-09-10 11:42:14 + [2025-09-09 21:05:19] iteration 2556/ 11920 | consumed samples: 2617344 | elapsed time per iteration (ms): 5626.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.044044E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:38:08.047489 | finish at 2025-09-10 11:43:27 + [2025-09-09 21:05:25] iteration 2557/ 11920 | consumed samples: 2618368 | elapsed time per iteration (ms): 5621.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.048573E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:37:09.872177 | finish at 2025-09-10 11:42:35 + [2025-09-09 21:05:30] iteration 2558/ 11920 | consumed samples: 2619392 | elapsed time per iteration (ms): 5617.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.035026E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:36:26.310323 | finish at 2025-09-10 11:41:57 + [2025-09-09 21:05:36] iteration 2559/ 11920 | consumed samples: 2620416 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.038831E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:36:53.952152 | finish at 2025-09-10 11:42:30 + [2025-09-09 21:05:42] iteration 2560/ 11920 | consumed samples: 2621440 | elapsed time per iteration (ms): 5617.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.039562E+00 | loss scale: 1.0 | grad norm: 0.137 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:36:19.930058 | finish at 2025-09-10 11:42:02 + [2025-09-09 21:05:47] iteration 2561/ 11920 | consumed samples: 2622464 | elapsed time per iteration (ms): 5617.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.041160E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:36:14.325932 | finish at 2025-09-10 11:42:02 + [2025-09-09 21:05:53] iteration 2562/ 11920 | consumed samples: 2623488 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.030941E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:37:06.273571 | finish at 2025-09-10 11:42:59 + [2025-09-09 21:05:58] iteration 2563/ 11920 | consumed samples: 2624512 | elapsed time per iteration (ms): 5627.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.042726E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:37:38.474519 | finish at 2025-09-10 11:43:37 + [2025-09-09 21:06:05] iteration 2564/ 11920 | consumed samples: 2625536 | elapsed time per iteration (ms): 6240.1 | throughput per GPU (TFLOP/s/GPU): 72.4 | MFU 7.32% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.039850E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:13:02.653560 | finish at 2025-09-10 13:19:07 + [2025-09-09 21:06:10] iteration 2565/ 11920 | consumed samples: 2626560 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.046878E+00 | loss scale: 1.0 | grad norm: 0.229 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:36:27.981753 | finish at 2025-09-10 11:42:38 + [2025-09-09 21:06:16] iteration 2566/ 11920 | consumed samples: 2627584 | elapsed time per iteration (ms): 5616.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.037673E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:35:38.348023 | finish at 2025-09-10 11:41:54 + [2025-09-09 21:06:22] iteration 2567/ 11920 | consumed samples: 2628608 | elapsed time per iteration (ms): 5938.1 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.044528E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:25:38.813859 | finish at 2025-09-10 12:32:01 + [2025-09-09 21:06:28] iteration 2568/ 11920 | consumed samples: 2629632 | elapsed time per iteration (ms): 5875.2 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.037040E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:15:44.418083 | finish at 2025-09-10 12:22:12 + [2025-09-09 21:06:33] iteration 2569/ 11920 | consumed samples: 2630656 | elapsed time per iteration (ms): 5618.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.058644E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:35:42.737998 | finish at 2025-09-10 11:42:16 + [2025-09-09 21:06:39] iteration 2570/ 11920 | consumed samples: 2631680 | elapsed time per iteration (ms): 5970.6 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.035145E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:30:25.351954 | finish at 2025-09-10 12:37:05 + [2025-09-09 21:06:45] iteration 2571/ 11920 | consumed samples: 2632704 | elapsed time per iteration (ms): 5619.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.048710E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:35:35.206897 | finish at 2025-09-10 11:42:20 + [2025-09-09 21:06:51] iteration 2572/ 11920 | consumed samples: 2633728 | elapsed time per iteration (ms): 5617.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.046207E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:35:12.381709 | finish at 2025-09-10 11:42:03 + [2025-09-09 21:06:56] iteration 2573/ 11920 | consumed samples: 2634752 | elapsed time per iteration (ms): 5629.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.045522E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:36:58.191361 | finish at 2025-09-10 11:43:54 + [2025-09-09 21:07:02] iteration 2574/ 11920 | consumed samples: 2635776 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.043241E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:35:32.525069 | finish at 2025-09-10 11:42:34 + [2025-09-09 21:07:08] iteration 2575/ 11920 | consumed samples: 2636800 | elapsed time per iteration (ms): 5922.5 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.023630E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:22:26.184844 | finish at 2025-09-10 12:29:34 + [2025-09-09 21:07:13] iteration 2576/ 11920 | consumed samples: 2637824 | elapsed time per iteration (ms): 5621.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.030077E+00 | loss scale: 1.0 | grad norm: 0.134 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:35:22.484131 | finish at 2025-09-10 11:42:36 + [2025-09-09 21:07:19] iteration 2577/ 11920 | consumed samples: 2638848 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.017716E+00 | loss scale: 1.0 | grad norm: 0.135 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:35:44.326545 | finish at 2025-09-10 11:43:03 + [2025-09-09 21:07:25] iteration 2578/ 11920 | consumed samples: 2639872 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.037469E+00 | loss scale: 1.0 | grad norm: 0.142 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:34:57.624410 | finish at 2025-09-10 11:42:22 + [2025-09-09 21:07:30] iteration 2579/ 11920 | consumed samples: 2640896 | elapsed time per iteration (ms): 5618.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.035670E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:34:46.149921 | finish at 2025-09-10 11:42:16 + [2025-09-09 21:07:36] iteration 2580/ 11920 | consumed samples: 2641920 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.032961E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:34:53.010173 | finish at 2025-09-10 11:42:29 + [2025-09-09 21:07:41] iteration 2581/ 11920 | consumed samples: 2642944 | elapsed time per iteration (ms): 5616.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.019592E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:34:08.457991 | finish at 2025-09-10 11:41:50 + [2025-09-09 21:07:47] iteration 2582/ 11920 | consumed samples: 2643968 | elapsed time per iteration (ms): 5628.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.035329E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:35:56.279045 | finish at 2025-09-10 11:43:43 + [2025-09-09 21:07:53] iteration 2583/ 11920 | consumed samples: 2644992 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.033655E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:34:52.213104 | finish at 2025-09-10 11:42:45 + [2025-09-09 21:07:59] iteration 2584/ 11920 | consumed samples: 2646016 | elapsed time per iteration (ms): 5906.4 | throughput per GPU (TFLOP/s/GPU): 76.4 | MFU 7.73% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.027530E+00 | loss scale: 1.0 | grad norm: 0.230 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:19:02.486183 | finish at 2025-09-10 12:27:01 + [2025-09-09 21:08:04] iteration 2585/ 11920 | consumed samples: 2647040 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.044940E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:34:41.679168 | finish at 2025-09-10 11:42:46 + [2025-09-09 21:08:10] iteration 2586/ 11920 | consumed samples: 2648064 | elapsed time per iteration (ms): 5911.8 | throughput per GPU (TFLOP/s/GPU): 76.4 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.027919E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:19:40.780396 | finish at 2025-09-10 12:27:51 + [2025-09-09 21:08:16] iteration 2587/ 11920 | consumed samples: 2649088 | elapsed time per iteration (ms): 5838.8 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.034177E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:08:13.453211 | finish at 2025-09-10 12:16:29 + [2025-09-09 21:08:22] iteration 2588/ 11920 | consumed samples: 2650112 | elapsed time per iteration (ms): 5628.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.044106E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:35:23.555464 | finish at 2025-09-10 11:43:45 + [2025-09-09 21:08:27] iteration 2589/ 11920 | consumed samples: 2651136 | elapsed time per iteration (ms): 5836.8 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.045982E+00 | loss scale: 1.0 | grad norm: 0.229 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:07:43.088282 | finish at 2025-09-10 12:16:11 + [2025-09-09 21:08:33] iteration 2590/ 11920 | consumed samples: 2652160 | elapsed time per iteration (ms): 5629.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.039352E+00 | loss scale: 1.0 | grad norm: 0.253 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:35:20.438054 | finish at 2025-09-10 11:43:54 + [2025-09-09 21:08:39] iteration 2591/ 11920 | consumed samples: 2653184 | elapsed time per iteration (ms): 5616.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.033717E+00 | loss scale: 1.0 | grad norm: 0.266 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:33:13.146959 | finish at 2025-09-10 11:41:52 + [2025-09-09 21:08:45] iteration 2592/ 11920 | consumed samples: 2654208 | elapsed time per iteration (ms): 5837.4 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.031200E+00 | loss scale: 1.0 | grad norm: 0.246 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:07:31.555939 | finish at 2025-09-10 12:16:16 + [2025-09-09 21:08:50] iteration 2593/ 11920 | consumed samples: 2655232 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.041536E+00 | loss scale: 1.0 | grad norm: 0.241 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:34:11.355060 | finish at 2025-09-10 11:43:02 + [2025-09-09 21:08:56] iteration 2594/ 11920 | consumed samples: 2656256 | elapsed time per iteration (ms): 5972.8 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.037520E+00 | loss scale: 1.0 | grad norm: 0.246 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:28:22.546408 | finish at 2025-09-10 12:37:19 + [2025-09-09 21:09:02] iteration 2595/ 11920 | consumed samples: 2657280 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.037227E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:34:53.490386 | finish at 2025-09-10 11:43:55 + [2025-09-09 21:09:08] iteration 2596/ 11920 | consumed samples: 2658304 | elapsed time per iteration (ms): 6199.0 | throughput per GPU (TFLOP/s/GPU): 72.8 | MFU 7.36% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.046479E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:03:19.612724 | finish at 2025-09-10 13:12:28 + [2025-09-09 21:09:14] iteration 2597/ 11920 | consumed samples: 2659328 | elapsed time per iteration (ms): 6178.1 | throughput per GPU (TFLOP/s/GPU): 73.1 | MFU 7.39% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.029540E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:59:58.358411 | finish at 2025-09-10 13:09:13 + [2025-09-09 21:09:20] iteration 2598/ 11920 | consumed samples: 2660352 | elapsed time per iteration (ms): 5635.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.039385E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:35:30.296695 | finish at 2025-09-10 11:44:50 + [2025-09-09 21:09:25] iteration 2599/ 11920 | consumed samples: 2661376 | elapsed time per iteration (ms): 5629.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.047501E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:34:29.595245 | finish at 2025-09-10 11:43:55 + [2025-09-09 21:09:31] iteration 2600/ 11920 | consumed samples: 2662400 | elapsed time per iteration (ms): 5628.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.039726E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:34:15.144482 | finish at 2025-09-10 11:43:46 + [2025-09-09 21:09:37] iteration 2601/ 11920 | consumed samples: 2663424 | elapsed time per iteration (ms): 5617.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.051368E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:32:27.619013 | finish at 2025-09-10 11:42:04 + [2025-09-09 21:09:42] iteration 2602/ 11920 | consumed samples: 2664448 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.045235E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:33:28.593703 | finish at 2025-09-10 11:43:11 + [2025-09-09 21:09:48] iteration 2603/ 11920 | consumed samples: 2665472 | elapsed time per iteration (ms): 5933.0 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.039860E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:21:17.537462 | finish at 2025-09-10 12:31:06 + [2025-09-09 21:09:54] iteration 2604/ 11920 | consumed samples: 2666496 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.031817E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:33:24.279107 | finish at 2025-09-10 11:43:18 + [2025-09-09 21:09:59] iteration 2605/ 11920 | consumed samples: 2667520 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.022568E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:33:19.624436 | finish at 2025-09-10 11:43:19 + [2025-09-09 21:10:05] iteration 2606/ 11920 | consumed samples: 2668544 | elapsed time per iteration (ms): 5627.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.026194E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:33:38.452725 | finish at 2025-09-10 11:43:44 + [2025-09-09 21:10:11] iteration 2607/ 11920 | consumed samples: 2669568 | elapsed time per iteration (ms): 5972.8 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.037060E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:27:04.491158 | finish at 2025-09-10 12:37:16 + [2025-09-09 21:10:17] iteration 2608/ 11920 | consumed samples: 2670592 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.040028E+00 | loss scale: 1.0 | grad norm: 0.244 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:32:35.089874 | finish at 2025-09-10 11:42:52 + [2025-09-09 21:10:22] iteration 2609/ 11920 | consumed samples: 2671616 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.030404E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:33:16.001416 | finish at 2025-09-10 11:43:38 + [2025-09-09 21:10:28] iteration 2610/ 11920 | consumed samples: 2672640 | elapsed time per iteration (ms): 5615.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.036129E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:31:23.137059 | finish at 2025-09-10 11:41:51 + [2025-09-09 21:10:34] iteration 2611/ 11920 | consumed samples: 2673664 | elapsed time per iteration (ms): 5614.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.045128E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:31:06.301993 | finish at 2025-09-10 11:41:40 + [2025-09-09 21:10:39] iteration 2612/ 11920 | consumed samples: 2674688 | elapsed time per iteration (ms): 5632.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.041305E+00 | loss scale: 1.0 | grad norm: 0.260 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:33:47.977358 | finish at 2025-09-10 11:44:27 + [2025-09-09 21:10:45] iteration 2613/ 11920 | consumed samples: 2675712 | elapsed time per iteration (ms): 5629.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.014670E+00 | loss scale: 1.0 | grad norm: 0.266 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:33:13.957609 | finish at 2025-09-10 11:43:59 + [2025-09-09 21:10:51] iteration 2614/ 11920 | consumed samples: 2676736 | elapsed time per iteration (ms): 6007.4 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.046256E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:31:44.445395 | finish at 2025-09-10 12:42:35 + [2025-09-09 21:10:56] iteration 2615/ 11920 | consumed samples: 2677760 | elapsed time per iteration (ms): 5631.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.043416E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:33:16.095996 | finish at 2025-09-10 11:44:13 + [2025-09-09 21:11:02] iteration 2616/ 11920 | consumed samples: 2678784 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.038249E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:32:19.525223 | finish at 2025-09-10 11:43:22 + [2025-09-09 21:11:08] iteration 2617/ 11920 | consumed samples: 2679808 | elapsed time per iteration (ms): 6088.3 | throughput per GPU (TFLOP/s/GPU): 74.2 | MFU 7.50% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.039649E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:43:59.099923 | finish at 2025-09-10 12:55:07 + [2025-09-09 21:11:14] iteration 2618/ 11920 | consumed samples: 2680832 | elapsed time per iteration (ms): 5626.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.042772E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:32:13.310805 | finish at 2025-09-10 11:43:27 + [2025-09-09 21:11:20] iteration 2619/ 11920 | consumed samples: 2681856 | elapsed time per iteration (ms): 6021.0 | throughput per GPU (TFLOP/s/GPU): 75.0 | MFU 7.58% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.044277E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:33:21.632820 | finish at 2025-09-10 12:44:41 + [2025-09-09 21:11:25] iteration 2620/ 11920 | consumed samples: 2682880 | elapsed time per iteration (ms): 5631.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.052845E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:32:49.074225 | finish at 2025-09-10 11:44:15 + [2025-09-09 21:11:31] iteration 2621/ 11920 | consumed samples: 2683904 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.023249E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:31:36.840612 | finish at 2025-09-10 11:43:08 + [2025-09-09 21:11:37] iteration 2622/ 11920 | consumed samples: 2684928 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.026192E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:31:27.955755 | finish at 2025-09-10 11:43:05 + [2025-09-09 21:11:42] iteration 2623/ 11920 | consumed samples: 2685952 | elapsed time per iteration (ms): 5628.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.025989E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:32:10.436350 | finish at 2025-09-10 11:43:53 + [2025-09-09 21:11:48] iteration 2624/ 11920 | consumed samples: 2686976 | elapsed time per iteration (ms): 5940.0 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.033574E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:20:18.052143 | finish at 2025-09-10 12:32:06 + [2025-09-09 21:11:54] iteration 2625/ 11920 | consumed samples: 2688000 | elapsed time per iteration (ms): 5628.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.035071E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:31:55.005944 | finish at 2025-09-10 11:43:49 + [2025-09-09 21:12:00] iteration 2626/ 11920 | consumed samples: 2689024 | elapsed time per iteration (ms): 5630.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.040059E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:32:13.734406 | finish at 2025-09-10 11:44:13 + [2025-09-09 21:12:06] iteration 2627/ 11920 | consumed samples: 2690048 | elapsed time per iteration (ms): 6004.2 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.031230E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:29:56.591736 | finish at 2025-09-10 12:42:02 + [2025-09-09 21:12:11] iteration 2628/ 11920 | consumed samples: 2691072 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.027360E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:30:47.889408 | finish at 2025-09-10 11:42:59 + [2025-09-09 21:12:17] iteration 2629/ 11920 | consumed samples: 2692096 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.023984E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:31:00.359839 | finish at 2025-09-10 11:43:17 + [2025-09-09 21:12:22] iteration 2630/ 11920 | consumed samples: 2693120 | elapsed time per iteration (ms): 5628.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.042552E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:31:28.501284 | finish at 2025-09-10 11:43:51 + [2025-09-09 21:12:28] iteration 2631/ 11920 | consumed samples: 2694144 | elapsed time per iteration (ms): 5932.7 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.020828E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:18:28.721094 | finish at 2025-09-10 12:30:57 + [2025-09-09 21:12:34] iteration 2632/ 11920 | consumed samples: 2695168 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.030804E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:30:49.295998 | finish at 2025-09-10 11:43:23 + [2025-09-09 21:12:40] iteration 2633/ 11920 | consumed samples: 2696192 | elapsed time per iteration (ms): 5629.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.031231E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:31:17.904178 | finish at 2025-09-10 11:43:57 + [2025-09-09 21:12:45] iteration 2634/ 11920 | consumed samples: 2697216 | elapsed time per iteration (ms): 5618.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.026107E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:29:30.329047 | finish at 2025-09-10 11:42:16 + [2025-09-09 21:12:51] iteration 2635/ 11920 | consumed samples: 2698240 | elapsed time per iteration (ms): 5619.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.020713E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:29:32.620486 | finish at 2025-09-10 11:42:23 + [2025-09-09 21:12:56] iteration 2636/ 11920 | consumed samples: 2699264 | elapsed time per iteration (ms): 5639.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.030369E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:32:36.189650 | finish at 2025-09-10 11:45:33 + [2025-09-09 21:13:02] iteration 2637/ 11920 | consumed samples: 2700288 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.035232E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:29:41.711048 | finish at 2025-09-10 11:42:44 + [2025-09-09 21:13:08] iteration 2638/ 11920 | consumed samples: 2701312 | elapsed time per iteration (ms): 5628.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.037616E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:30:40.021229 | finish at 2025-09-10 11:43:48 + [2025-09-09 21:13:13] iteration 2639/ 11920 | consumed samples: 2702336 | elapsed time per iteration (ms): 5619.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.020797E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:29:15.820134 | finish at 2025-09-10 11:42:29 + [2025-09-09 21:13:19] iteration 2640/ 11920 | consumed samples: 2703360 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.037096E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:29:44.729156 | finish at 2025-09-10 11:43:04 + [2025-09-09 21:13:25] iteration 2641/ 11920 | consumed samples: 2704384 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.026634E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:29:45.428515 | finish at 2025-09-10 11:43:10 + [2025-09-09 21:13:30] iteration 2642/ 11920 | consumed samples: 2705408 | elapsed time per iteration (ms): 5619.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.026567E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:28:58.998838 | finish at 2025-09-10 11:42:29 + [2025-09-09 21:13:36] iteration 2643/ 11920 | consumed samples: 2706432 | elapsed time per iteration (ms): 5618.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.041635E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:28:39.719067 | finish at 2025-09-10 11:42:16 + [2025-09-09 21:13:41] iteration 2644/ 11920 | consumed samples: 2707456 | elapsed time per iteration (ms): 5617.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.032710E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:28:25.208176 | finish at 2025-09-10 11:42:07 + [2025-09-09 21:13:47] iteration 2645/ 11920 | consumed samples: 2708480 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.024196E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:29:01.316599 | finish at 2025-09-10 11:42:48 + [2025-09-09 21:13:53] iteration 2646/ 11920 | consumed samples: 2709504 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.027166E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:29:12.222820 | finish at 2025-09-10 11:43:05 + [2025-09-09 21:13:58] iteration 2647/ 11920 | consumed samples: 2710528 | elapsed time per iteration (ms): 5629.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.030721E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:30:04.006406 | finish at 2025-09-10 11:44:02 + [2025-09-09 21:14:04] iteration 2648/ 11920 | consumed samples: 2711552 | elapsed time per iteration (ms): 5830.2 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.029639E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:00:57.525576 | finish at 2025-09-10 12:15:02 + [2025-09-09 21:14:10] iteration 2649/ 11920 | consumed samples: 2712576 | elapsed time per iteration (ms): 5619.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.028316E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:28:13.306535 | finish at 2025-09-10 11:42:23 + [2025-09-09 21:14:15] iteration 2650/ 11920 | consumed samples: 2713600 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.023268E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:28:33.431296 | finish at 2025-09-10 11:42:49 + [2025-09-09 21:14:21] iteration 2651/ 11920 | consumed samples: 2714624 | elapsed time per iteration (ms): 5630.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.039363E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:29:46.117438 | finish at 2025-09-10 11:44:07 + [2025-09-09 21:14:27] iteration 2652/ 11920 | consumed samples: 2715648 | elapsed time per iteration (ms): 5636.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.041975E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:30:36.670161 | finish at 2025-09-10 11:45:03 + [2025-09-09 21:14:32] iteration 2653/ 11920 | consumed samples: 2716672 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.037200E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:28:36.280807 | finish at 2025-09-10 11:43:09 + [2025-09-09 21:14:38] iteration 2654/ 11920 | consumed samples: 2717696 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.024667E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:28:27.104578 | finish at 2025-09-10 11:43:05 + [2025-09-09 21:14:44] iteration 2655/ 11920 | consumed samples: 2718720 | elapsed time per iteration (ms): 5618.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.026737E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:27:35.890625 | finish at 2025-09-10 11:42:19 + [2025-09-09 21:14:49] iteration 2656/ 11920 | consumed samples: 2719744 | elapsed time per iteration (ms): 5617.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.035382E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:27:23.040756 | finish at 2025-09-10 11:42:12 + [2025-09-09 21:14:55] iteration 2657/ 11920 | consumed samples: 2720768 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.035231E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:28:29.522947 | finish at 2025-09-10 11:43:24 + [2025-09-09 21:15:00] iteration 2658/ 11920 | consumed samples: 2721792 | elapsed time per iteration (ms): 5633.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.034132E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:29:36.031526 | finish at 2025-09-10 11:44:36 + [2025-09-09 21:15:06] iteration 2659/ 11920 | consumed samples: 2722816 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.041024E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:27:32.341134 | finish at 2025-09-10 11:42:38 + [2025-09-09 21:15:12] iteration 2660/ 11920 | consumed samples: 2723840 | elapsed time per iteration (ms): 5865.2 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.022915E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:05:11.648631 | finish at 2025-09-10 12:20:24 + [2025-09-09 21:15:18] iteration 2661/ 11920 | consumed samples: 2724864 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.022657E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:28:01.208330 | finish at 2025-09-10 11:43:19 + [2025-09-09 21:15:23] iteration 2662/ 11920 | consumed samples: 2725888 | elapsed time per iteration (ms): 5629.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.039228E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:28:39.400101 | finish at 2025-09-10 11:44:03 + [2025-09-09 21:15:29] iteration 2663/ 11920 | consumed samples: 2726912 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.030069E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:27:21.542824 | finish at 2025-09-10 11:42:50 + [2025-09-09 21:15:34] iteration 2664/ 11920 | consumed samples: 2727936 | elapsed time per iteration (ms): 5627.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.042936E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:28:09.190968 | finish at 2025-09-10 11:43:44 + [2025-09-09 21:15:40] iteration 2665/ 11920 | consumed samples: 2728960 | elapsed time per iteration (ms): 5647.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.024434E+00 | loss scale: 1.0 | grad norm: 0.255 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:31:03.568221 | finish at 2025-09-10 11:46:44 + [2025-09-09 21:15:46] iteration 2666/ 11920 | consumed samples: 2729984 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.014595E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:27:32.757154 | finish at 2025-09-10 11:43:18 + [2025-09-09 21:15:51] iteration 2667/ 11920 | consumed samples: 2731008 | elapsed time per iteration (ms): 5626.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.022299E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:27:37.097156 | finish at 2025-09-10 11:43:28 + [2025-09-09 21:15:57] iteration 2668/ 11920 | consumed samples: 2732032 | elapsed time per iteration (ms): 5966.7 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.037976E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:20:03.818673 | finish at 2025-09-10 12:36:01 + [2025-09-09 21:16:03] iteration 2669/ 11920 | consumed samples: 2733056 | elapsed time per iteration (ms): 5823.9 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.84% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.027230E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:57:57.275915 | finish at 2025-09-10 12:14:00 + [2025-09-09 21:16:09] iteration 2670/ 11920 | consumed samples: 2734080 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.044509E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:26:54.142931 | finish at 2025-09-10 11:43:03 + [2025-09-09 21:16:14] iteration 2671/ 11920 | consumed samples: 2735104 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.025342E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:27:12.438862 | finish at 2025-09-10 11:43:27 + [2025-09-09 21:16:20] iteration 2672/ 11920 | consumed samples: 2736128 | elapsed time per iteration (ms): 5958.3 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.032568E+00 | loss scale: 1.0 | grad norm: 0.240 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:18:22.421188 | finish at 2025-09-10 12:34:43 + [2025-09-09 21:16:26] iteration 2673/ 11920 | consumed samples: 2737152 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.052609E+00 | loss scale: 1.0 | grad norm: 0.254 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:26:45.477006 | finish at 2025-09-10 11:43:11 + [2025-09-09 21:16:32] iteration 2674/ 11920 | consumed samples: 2738176 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.041354E+00 | loss scale: 1.0 | grad norm: 0.229 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:27:28.145157 | finish at 2025-09-10 11:44:00 + [2025-09-09 21:16:37] iteration 2675/ 11920 | consumed samples: 2739200 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.017567E+00 | loss scale: 1.0 | grad norm: 0.229 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:25:57.926090 | finish at 2025-09-10 11:42:35 + [2025-09-09 21:16:43] iteration 2676/ 11920 | consumed samples: 2740224 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.027701E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:26:04.983049 | finish at 2025-09-10 11:42:48 + [2025-09-09 21:16:48] iteration 2677/ 11920 | consumed samples: 2741248 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.042196E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:26:34.759647 | finish at 2025-09-10 11:43:23 + [2025-09-09 21:16:54] iteration 2678/ 11920 | consumed samples: 2742272 | elapsed time per iteration (ms): 5938.3 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.040613E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:14:41.710333 | finish at 2025-09-10 12:31:36 + [2025-09-09 21:17:00] iteration 2679/ 11920 | consumed samples: 2743296 | elapsed time per iteration (ms): 5631.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.040462E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:27:20.964753 | finish at 2025-09-10 11:44:21 + [2025-09-09 21:17:06] iteration 2680/ 11920 | consumed samples: 2744320 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.046740E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:25:26.904373 | finish at 2025-09-10 11:42:32 + [2025-09-09 21:17:11] iteration 2681/ 11920 | consumed samples: 2745344 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.021299E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:25:25.756159 | finish at 2025-09-10 11:42:37 + [2025-09-09 21:17:17] iteration 2682/ 11920 | consumed samples: 2746368 | elapsed time per iteration (ms): 5938.2 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.030139E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:14:17.155445 | finish at 2025-09-10 12:31:34 + [2025-09-09 21:17:23] iteration 2683/ 11920 | consumed samples: 2747392 | elapsed time per iteration (ms): 5623.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.021890E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:25:40.491405 | finish at 2025-09-10 11:43:03 + [2025-09-09 21:17:28] iteration 2684/ 11920 | consumed samples: 2748416 | elapsed time per iteration (ms): 5618.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.026214E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:24:50.235287 | finish at 2025-09-10 11:42:19 + [2025-09-09 21:17:34] iteration 2685/ 11920 | consumed samples: 2749440 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.030726E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:25:21.127203 | finish at 2025-09-10 11:42:55 + [2025-09-09 21:17:40] iteration 2686/ 11920 | consumed samples: 2750464 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.017461E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:25:05.476898 | finish at 2025-09-10 11:42:45 + [2025-09-09 21:17:46] iteration 2687/ 11920 | consumed samples: 2751488 | elapsed time per iteration (ms): 5962.8 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.044177E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:17:34.184831 | finish at 2025-09-10 12:35:20 + [2025-09-09 21:17:51] iteration 2688/ 11920 | consumed samples: 2752512 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.020183E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:25:14.154423 | finish at 2025-09-10 11:43:05 + [2025-09-09 21:17:57] iteration 2689/ 11920 | consumed samples: 2753536 | elapsed time per iteration (ms): 5635.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.020597E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:26:57.067858 | finish at 2025-09-10 11:44:54 + [2025-09-09 21:18:02] iteration 2690/ 11920 | consumed samples: 2754560 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.017330E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:24:52.404375 | finish at 2025-09-10 11:42:55 + [2025-09-09 21:18:08] iteration 2691/ 11920 | consumed samples: 2755584 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.034623E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:24:27.280394 | finish at 2025-09-10 11:42:35 + [2025-09-09 21:18:14] iteration 2692/ 11920 | consumed samples: 2756608 | elapsed time per iteration (ms): 5618.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.007422E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:24:11.018349 | finish at 2025-09-10 11:42:25 + [2025-09-09 21:18:19] iteration 2693/ 11920 | consumed samples: 2757632 | elapsed time per iteration (ms): 5618.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.027307E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:23:58.623814 | finish at 2025-09-10 11:42:18 + [2025-09-09 21:18:25] iteration 2694/ 11920 | consumed samples: 2758656 | elapsed time per iteration (ms): 5619.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.029797E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:24:07.734524 | finish at 2025-09-10 11:42:33 + [2025-09-09 21:18:31] iteration 2695/ 11920 | consumed samples: 2759680 | elapsed time per iteration (ms): 5908.0 | throughput per GPU (TFLOP/s/GPU): 76.4 | MFU 7.73% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.035108E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:08:20.912833 | finish at 2025-09-10 12:26:52 + [2025-09-09 21:18:36] iteration 2696/ 11920 | consumed samples: 2760704 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.035986E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:25:06.312185 | finish at 2025-09-10 11:43:43 + [2025-09-09 21:18:42] iteration 2697/ 11920 | consumed samples: 2761728 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.032481E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:24:09.330957 | finish at 2025-09-10 11:42:51 + [2025-09-09 21:18:48] iteration 2698/ 11920 | consumed samples: 2762752 | elapsed time per iteration (ms): 5829.3 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.022505E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:55:58.254657 | finish at 2025-09-10 12:14:46 + [2025-09-09 21:18:54] iteration 2699/ 11920 | consumed samples: 2763776 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.017056E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:23:37.353817 | finish at 2025-09-10 11:42:31 + [2025-09-09 21:18:59] iteration 2700/ 11920 | consumed samples: 2764800 | elapsed time per iteration (ms): 5633.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.029156E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:25:36.113968 | finish at 2025-09-10 11:44:35 + [2025-09-09 21:19:05] iteration 2701/ 11920 | consumed samples: 2765824 | elapsed time per iteration (ms): 5623.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.029216E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:23:58.570215 | finish at 2025-09-10 11:43:03 + [2025-09-09 21:19:10] iteration 2702/ 11920 | consumed samples: 2766848 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.036829E+00 | loss scale: 1.0 | grad norm: 0.271 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:23:43.969423 | finish at 2025-09-10 11:42:54 + [2025-09-09 21:19:16] iteration 2703/ 11920 | consumed samples: 2767872 | elapsed time per iteration (ms): 5622.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.042099E+00 | loss scale: 1.0 | grad norm: 0.285 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:23:43.504924 | finish at 2025-09-10 11:43:00 + [2025-09-09 21:19:22] iteration 2704/ 11920 | consumed samples: 2768896 | elapsed time per iteration (ms): 5634.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.022536E+00 | loss scale: 1.0 | grad norm: 0.251 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:25:30.826172 | finish at 2025-09-10 11:44:53 + [2025-09-09 21:19:27] iteration 2705/ 11920 | consumed samples: 2769920 | elapsed time per iteration (ms): 5629.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.038363E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:24:32.289099 | finish at 2025-09-10 11:44:00 + [2025-09-09 21:19:33] iteration 2706/ 11920 | consumed samples: 2770944 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.029466E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:23:19.339392 | finish at 2025-09-10 11:42:52 + [2025-09-09 21:19:39] iteration 2707/ 11920 | consumed samples: 2771968 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.050721E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:22:55.703674 | finish at 2025-09-10 11:42:34 + [2025-09-09 21:19:44] iteration 2708/ 11920 | consumed samples: 2772992 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.038334E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:23:35.158731 | finish at 2025-09-10 11:43:19 + [2025-09-09 21:19:50] iteration 2709/ 11920 | consumed samples: 2774016 | elapsed time per iteration (ms): 5625.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.024416E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:23:40.013649 | finish at 2025-09-10 11:43:30 + [2025-09-09 21:19:55] iteration 2710/ 11920 | consumed samples: 2775040 | elapsed time per iteration (ms): 5632.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.030004E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:24:33.086829 | finish at 2025-09-10 11:44:29 + [2025-09-09 21:20:01] iteration 2711/ 11920 | consumed samples: 2776064 | elapsed time per iteration (ms): 5950.8 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.014088E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:13:21.230636 | finish at 2025-09-10 12:33:23 + [2025-09-09 21:20:07] iteration 2712/ 11920 | consumed samples: 2777088 | elapsed time per iteration (ms): 5978.4 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.018127E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:17:29.098345 | finish at 2025-09-10 12:37:36 + [2025-09-09 21:20:13] iteration 2713/ 11920 | consumed samples: 2778112 | elapsed time per iteration (ms): 6089.0 | throughput per GPU (TFLOP/s/GPU): 74.1 | MFU 7.50% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.028565E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:34:21.539218 | finish at 2025-09-10 12:54:35 + [2025-09-09 21:20:19] iteration 2714/ 11920 | consumed samples: 2779136 | elapsed time per iteration (ms): 5834.4 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.015821E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:55:11.488029 | finish at 2025-09-10 12:15:31 + [2025-09-09 21:20:25] iteration 2715/ 11920 | consumed samples: 2780160 | elapsed time per iteration (ms): 5628.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.011943E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:23:28.621759 | finish at 2025-09-10 11:43:54 + [2025-09-09 21:20:31] iteration 2716/ 11920 | consumed samples: 2781184 | elapsed time per iteration (ms): 6002.5 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.022484E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:20:46.944695 | finish at 2025-09-10 12:41:18 + [2025-09-09 21:20:37] iteration 2717/ 11920 | consumed samples: 2782208 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.028260E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:22:01.973583 | finish at 2025-09-10 11:42:39 + [2025-09-09 21:20:42] iteration 2718/ 11920 | consumed samples: 2783232 | elapsed time per iteration (ms): 5885.3 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.019578E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:02:36.246857 | finish at 2025-09-10 12:23:19 + [2025-09-09 21:20:48] iteration 2719/ 11920 | consumed samples: 2784256 | elapsed time per iteration (ms): 5617.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.020998E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:21:28.322611 | finish at 2025-09-10 11:42:16 + [2025-09-09 21:20:54] iteration 2720/ 11920 | consumed samples: 2785280 | elapsed time per iteration (ms): 5629.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.039481E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:23:15.413208 | finish at 2025-09-10 11:44:09 + [2025-09-09 21:21:00] iteration 2721/ 11920 | consumed samples: 2786304 | elapsed time per iteration (ms): 5968.6 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.024448E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:15:05.276732 | finish at 2025-09-10 12:36:05 + [2025-09-09 21:21:05] iteration 2722/ 11920 | consumed samples: 2787328 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.024610E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:21:47.076875 | finish at 2025-09-10 11:42:52 + [2025-09-09 21:21:11] iteration 2723/ 11920 | consumed samples: 2788352 | elapsed time per iteration (ms): 5945.9 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.020098E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:11:24.176642 | finish at 2025-09-10 12:32:35 + [2025-09-09 21:21:17] iteration 2724/ 11920 | consumed samples: 2789376 | elapsed time per iteration (ms): 5616.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.023010E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:20:46.395142 | finish at 2025-09-10 11:42:03 + [2025-09-09 21:21:22] iteration 2725/ 11920 | consumed samples: 2790400 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.023262E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:22:08.440815 | finish at 2025-09-10 11:43:31 + [2025-09-09 21:21:28] iteration 2726/ 11920 | consumed samples: 2791424 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.018878E+00 | loss scale: 1.0 | grad norm: 0.250 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:21:28.439837 | finish at 2025-09-10 11:42:57 + [2025-09-09 21:21:34] iteration 2727/ 11920 | consumed samples: 2792448 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.025761E+00 | loss scale: 1.0 | grad norm: 0.253 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:21:18.833202 | finish at 2025-09-10 11:42:53 + [2025-09-09 21:21:39] iteration 2728/ 11920 | consumed samples: 2793472 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.015482E+00 | loss scale: 1.0 | grad norm: 0.261 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:21:18.960079 | finish at 2025-09-10 11:42:58 + [2025-09-09 21:21:45] iteration 2729/ 11920 | consumed samples: 2794496 | elapsed time per iteration (ms): 5618.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.028011E+00 | loss scale: 1.0 | grad norm: 0.258 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:20:38.682421 | finish at 2025-09-10 11:42:24 + [2025-09-09 21:21:51] iteration 2730/ 11920 | consumed samples: 2795520 | elapsed time per iteration (ms): 5628.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.033011E+00 | loss scale: 1.0 | grad norm: 0.241 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:22:03.706264 | finish at 2025-09-10 11:43:54 + [2025-09-09 21:21:57] iteration 2731/ 11920 | consumed samples: 2796544 | elapsed time per iteration (ms): 6424.4 | throughput per GPU (TFLOP/s/GPU): 70.3 | MFU 7.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.031872E+00 | loss scale: 1.0 | grad norm: 0.242 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:23:53.821201 | finish at 2025-09-10 13:45:51 + [2025-09-09 21:22:03] iteration 2732/ 11920 | consumed samples: 2797568 | elapsed time per iteration (ms): 5626.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.033174E+00 | loss scale: 1.0 | grad norm: 0.277 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:21:33.446378 | finish at 2025-09-10 11:43:36 + [2025-09-09 21:22:08] iteration 2733/ 11920 | consumed samples: 2798592 | elapsed time per iteration (ms): 5824.6 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.84% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.034946E+00 | loss scale: 1.0 | grad norm: 0.301 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:51:50.580315 | finish at 2025-09-10 12:13:59 + [2025-09-09 21:22:15] iteration 2734/ 11920 | consumed samples: 2799616 | elapsed time per iteration (ms): 6171.7 | throughput per GPU (TFLOP/s/GPU): 73.2 | MFU 7.40% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.025024E+00 | loss scale: 1.0 | grad norm: 0.266 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:44:52.872648 | finish at 2025-09-10 13:07:08 + [2025-09-09 21:22:20] iteration 2735/ 11920 | consumed samples: 2800640 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.031980E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:21:00.868592 | finish at 2025-09-10 11:43:21 + [2025-09-09 21:22:26] iteration 2736/ 11920 | consumed samples: 2801664 | elapsed time per iteration (ms): 5877.8 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.022213E+00 | loss scale: 1.0 | grad norm: 0.241 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:59:42.021637 | finish at 2025-09-10 12:22:08 + [2025-09-09 21:22:32] iteration 2737/ 11920 | consumed samples: 2802688 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.038823E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:20:56.507473 | finish at 2025-09-10 11:43:28 + [2025-09-09 21:22:37] iteration 2738/ 11920 | consumed samples: 2803712 | elapsed time per iteration (ms): 5633.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.027393E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:22:04.159974 | finish at 2025-09-10 11:44:42 + [2025-09-09 21:22:44] iteration 2739/ 11920 | consumed samples: 2804736 | elapsed time per iteration (ms): 6239.5 | throughput per GPU (TFLOP/s/GPU): 72.4 | MFU 7.32% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.023379E+00 | loss scale: 1.0 | grad norm: 0.276 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:54:44.685751 | finish at 2025-09-10 13:17:28 + [2025-09-09 21:22:49] iteration 2740/ 11920 | consumed samples: 2805760 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.018378E+00 | loss scale: 1.0 | grad norm: 0.252 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:20:43.006725 | finish at 2025-09-10 11:43:32 + [2025-09-09 21:22:55] iteration 2741/ 11920 | consumed samples: 2806784 | elapsed time per iteration (ms): 5915.4 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.032437E+00 | loss scale: 1.0 | grad norm: 0.258 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:04:57.152665 | finish at 2025-09-10 12:27:52 + [2025-09-09 21:23:01] iteration 2742/ 11920 | consumed samples: 2807808 | elapsed time per iteration (ms): 5631.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.025561E+00 | loss scale: 1.0 | grad norm: 0.242 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:21:26.401587 | finish at 2025-09-10 11:44:27 + [2025-09-09 21:23:06] iteration 2743/ 11920 | consumed samples: 2808832 | elapsed time per iteration (ms): 5627.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.027174E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:20:40.769615 | finish at 2025-09-10 11:43:47 + [2025-09-09 21:23:12] iteration 2744/ 11920 | consumed samples: 2809856 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.034023E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:20:20.534954 | finish at 2025-09-10 11:43:33 + [2025-09-09 21:23:18] iteration 2745/ 11920 | consumed samples: 2810880 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.038955E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:19:26.738623 | finish at 2025-09-10 11:42:44 + [2025-09-09 21:23:23] iteration 2746/ 11920 | consumed samples: 2811904 | elapsed time per iteration (ms): 5629.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.029145E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:20:45.979275 | finish at 2025-09-10 11:44:09 + [2025-09-09 21:23:29] iteration 2747/ 11920 | consumed samples: 2812928 | elapsed time per iteration (ms): 5949.8 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.016055E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:09:37.148115 | finish at 2025-09-10 12:33:06 + [2025-09-09 21:23:35] iteration 2748/ 11920 | consumed samples: 2813952 | elapsed time per iteration (ms): 5628.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.016826E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:20:21.538188 | finish at 2025-09-10 11:43:56 + [2025-09-09 21:23:41] iteration 2749/ 11920 | consumed samples: 2814976 | elapsed time per iteration (ms): 5623.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.020288E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:19:28.184484 | finish at 2025-09-10 11:43:09 + [2025-09-09 21:23:46] iteration 2750/ 11920 | consumed samples: 2816000 | elapsed time per iteration (ms): 5625.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.013802E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:19:44.971080 | finish at 2025-09-10 11:43:31 + [2025-09-09 21:23:52] iteration 2751/ 11920 | consumed samples: 2817024 | elapsed time per iteration (ms): 6003.5 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.038799E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:17:26.441130 | finish at 2025-09-10 12:41:19 + [2025-09-09 21:23:58] iteration 2752/ 11920 | consumed samples: 2818048 | elapsed time per iteration (ms): 5630.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.021558E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:20:24.059738 | finish at 2025-09-10 11:44:22 + [2025-09-09 21:24:03] iteration 2753/ 11920 | consumed samples: 2819072 | elapsed time per iteration (ms): 5618.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.015718E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:18:26.970659 | finish at 2025-09-10 11:42:30 + [2025-09-09 21:24:09] iteration 2754/ 11920 | consumed samples: 2820096 | elapsed time per iteration (ms): 5619.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.024412E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:18:30.552223 | finish at 2025-09-10 11:42:40 + [2025-09-09 21:24:15] iteration 2755/ 11920 | consumed samples: 2821120 | elapsed time per iteration (ms): 5848.1 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.009590E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:53:18.170335 | finish at 2025-09-10 12:17:33 + [2025-09-09 21:24:21] iteration 2756/ 11920 | consumed samples: 2822144 | elapsed time per iteration (ms): 5964.2 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.028616E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:10:55.817554 | finish at 2025-09-10 12:35:17 + [2025-09-09 21:24:26] iteration 2757/ 11920 | consumed samples: 2823168 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.006406E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:19:06.751093 | finish at 2025-09-10 11:43:33 + [2025-09-09 21:24:32] iteration 2758/ 11920 | consumed samples: 2824192 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.022093E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:18:39.355920 | finish at 2025-09-10 11:43:11 + [2025-09-09 21:24:38] iteration 2759/ 11920 | consumed samples: 2825216 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.012919E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:18:21.455642 | finish at 2025-09-10 11:42:59 + [2025-09-09 21:24:43] iteration 2760/ 11920 | consumed samples: 2826240 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.020885E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:18:07.181158 | finish at 2025-09-10 11:42:50 + [2025-09-09 21:24:49] iteration 2761/ 11920 | consumed samples: 2827264 | elapsed time per iteration (ms): 5616.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.020825E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:17:21.605571 | finish at 2025-09-10 11:42:11 + [2025-09-09 21:24:55] iteration 2762/ 11920 | consumed samples: 2828288 | elapsed time per iteration (ms): 5628.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.014079E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:19:07.099822 | finish at 2025-09-10 11:44:02 + [2025-09-09 21:25:00] iteration 2763/ 11920 | consumed samples: 2829312 | elapsed time per iteration (ms): 5628.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.011788E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:18:57.456278 | finish at 2025-09-10 11:43:58 + [2025-09-09 21:25:06] iteration 2764/ 11920 | consumed samples: 2830336 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.026402E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:18:08.302025 | finish at 2025-09-10 11:43:14 + [2025-09-09 21:25:11] iteration 2765/ 11920 | consumed samples: 2831360 | elapsed time per iteration (ms): 5623.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.013300E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:17:58.727849 | finish at 2025-09-10 11:43:10 + [2025-09-09 21:25:17] iteration 2766/ 11920 | consumed samples: 2832384 | elapsed time per iteration (ms): 5833.5 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.011483E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:49:59.799235 | finish at 2025-09-10 12:15:17 + [2025-09-09 21:25:23] iteration 2767/ 11920 | consumed samples: 2833408 | elapsed time per iteration (ms): 5627.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.008281E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:18:30.648806 | finish at 2025-09-10 11:43:54 + [2025-09-09 21:25:29] iteration 2768/ 11920 | consumed samples: 2834432 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.006071E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:17:25.816681 | finish at 2025-09-10 11:42:54 + [2025-09-09 21:25:34] iteration 2769/ 11920 | consumed samples: 2835456 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.017568E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:17:15.552613 | finish at 2025-09-10 11:42:50 + [2025-09-09 21:25:40] iteration 2770/ 11920 | consumed samples: 2836480 | elapsed time per iteration (ms): 5632.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.021821E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:18:55.151410 | finish at 2025-09-10 11:44:35 + [2025-09-09 21:25:45] iteration 2771/ 11920 | consumed samples: 2837504 | elapsed time per iteration (ms): 5623.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.015423E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:17:25.740106 | finish at 2025-09-10 11:43:11 + [2025-09-09 21:25:51] iteration 2772/ 11920 | consumed samples: 2838528 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.020462E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:17:25.115980 | finish at 2025-09-10 11:43:16 + [2025-09-09 21:25:57] iteration 2773/ 11920 | consumed samples: 2839552 | elapsed time per iteration (ms): 5625.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.022509E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:17:39.939653 | finish at 2025-09-10 11:43:37 + [2025-09-09 21:26:02] iteration 2774/ 11920 | consumed samples: 2840576 | elapsed time per iteration (ms): 5631.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.017576E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:18:22.314798 | finish at 2025-09-10 11:44:25 + [2025-09-09 21:26:08] iteration 2775/ 11920 | consumed samples: 2841600 | elapsed time per iteration (ms): 5948.6 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.009512E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:06:39.773065 | finish at 2025-09-10 12:32:48 + [2025-09-09 21:26:14] iteration 2776/ 11920 | consumed samples: 2842624 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.022170E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:16:40.205612 | finish at 2025-09-10 11:42:54 + [2025-09-09 21:26:19] iteration 2777/ 11920 | consumed samples: 2843648 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.016192E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:16:40.814460 | finish at 2025-09-10 11:43:00 + [2025-09-09 21:26:25] iteration 2778/ 11920 | consumed samples: 2844672 | elapsed time per iteration (ms): 5629.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.006012E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:17:46.346365 | finish at 2025-09-10 11:44:11 + [2025-09-09 21:26:31] iteration 2779/ 11920 | consumed samples: 2845696 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.010670E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:16:18.019971 | finish at 2025-09-10 11:42:49 + [2025-09-09 21:26:37] iteration 2780/ 11920 | consumed samples: 2846720 | elapsed time per iteration (ms): 6165.8 | throughput per GPU (TFLOP/s/GPU): 73.2 | MFU 7.40% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.997655E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:39:15.774603 | finish at 2025-09-10 13:05:53 + [2025-09-09 21:26:43] iteration 2781/ 11920 | consumed samples: 2847744 | elapsed time per iteration (ms): 5629.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.014298E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:17:25.528817 | finish at 2025-09-10 11:44:08 + [2025-09-09 21:26:48] iteration 2782/ 11920 | consumed samples: 2848768 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.012916E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:16:34.406801 | finish at 2025-09-10 11:43:23 + [2025-09-09 21:26:54] iteration 2783/ 11920 | consumed samples: 2849792 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.012512E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:16:56.849449 | finish at 2025-09-10 11:43:51 + [2025-09-09 21:27:00] iteration 2784/ 11920 | consumed samples: 2850816 | elapsed time per iteration (ms): 5947.2 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.012468E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:05:34.064098 | finish at 2025-09-10 12:32:34 + [2025-09-09 21:27:05] iteration 2785/ 11920 | consumed samples: 2851840 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.010547E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:16:30.762938 | finish at 2025-09-10 11:43:36 + [2025-09-09 21:27:11] iteration 2786/ 11920 | consumed samples: 2852864 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.017224E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:15:38.577686 | finish at 2025-09-10 11:42:50 + [2025-09-09 21:27:17] iteration 2787/ 11920 | consumed samples: 2853888 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.996867E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:15:25.740926 | finish at 2025-09-10 11:42:42 + [2025-09-09 21:27:22] iteration 2788/ 11920 | consumed samples: 2854912 | elapsed time per iteration (ms): 5628.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.014053E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:16:37.099557 | finish at 2025-09-10 11:43:59 + [2025-09-09 21:27:28] iteration 2789/ 11920 | consumed samples: 2855936 | elapsed time per iteration (ms): 5621.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.026023E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:15:25.231735 | finish at 2025-09-10 11:42:53 + [2025-09-09 21:27:33] iteration 2790/ 11920 | consumed samples: 2856960 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.021126E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:15:10.161426 | finish at 2025-09-10 11:42:44 + [2025-09-09 21:27:39] iteration 2791/ 11920 | consumed samples: 2857984 | elapsed time per iteration (ms): 5626.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.993280E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:16:04.145562 | finish at 2025-09-10 11:43:43 + [2025-09-09 21:27:45] iteration 2792/ 11920 | consumed samples: 2859008 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.022356E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:15:15.324181 | finish at 2025-09-10 11:43:00 + [2025-09-09 21:27:50] iteration 2793/ 11920 | consumed samples: 2860032 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.007699E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:15:49.502320 | finish at 2025-09-10 11:43:40 + [2025-09-09 21:27:56] iteration 2794/ 11920 | consumed samples: 2861056 | elapsed time per iteration (ms): 5628.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.007365E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:16:03.963269 | finish at 2025-09-10 11:44:00 + [2025-09-09 21:28:02] iteration 2795/ 11920 | consumed samples: 2862080 | elapsed time per iteration (ms): 5636.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.024439E+00 | loss scale: 1.0 | grad norm: 0.273 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:17:16.983973 | finish at 2025-09-10 11:45:19 + [2025-09-09 21:28:07] iteration 2796/ 11920 | consumed samples: 2863104 | elapsed time per iteration (ms): 5628.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.027662E+00 | loss scale: 1.0 | grad norm: 0.259 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:15:53.176519 | finish at 2025-09-10 11:44:00 + [2025-09-09 21:28:13] iteration 2797/ 11920 | consumed samples: 2864128 | elapsed time per iteration (ms): 5627.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.016060E+00 | loss scale: 1.0 | grad norm: 0.240 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:15:43.452458 | finish at 2025-09-10 11:43:56 + [2025-09-09 21:28:18] iteration 2798/ 11920 | consumed samples: 2865152 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.025496E+00 | loss scale: 1.0 | grad norm: 0.250 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:14:46.652398 | finish at 2025-09-10 11:43:05 + [2025-09-09 21:28:24] iteration 2799/ 11920 | consumed samples: 2866176 | elapsed time per iteration (ms): 5628.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.004413E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:15:41.208239 | finish at 2025-09-10 11:44:05 + [2025-09-09 21:28:30] iteration 2800/ 11920 | consumed samples: 2867200 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.004677E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:14:24.505463 | finish at 2025-09-10 11:42:54 + [2025-09-09 21:28:35] iteration 2801/ 11920 | consumed samples: 2868224 | elapsed time per iteration (ms): 5619.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.012772E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:13:59.619309 | finish at 2025-09-10 11:42:35 + [2025-09-09 21:28:41] iteration 2802/ 11920 | consumed samples: 2869248 | elapsed time per iteration (ms): 5617.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.994277E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:13:35.780852 | finish at 2025-09-10 11:42:17 + [2025-09-09 21:28:47] iteration 2803/ 11920 | consumed samples: 2870272 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.002074E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:14:55.412712 | finish at 2025-09-10 11:43:42 + [2025-09-09 21:28:52] iteration 2804/ 11920 | consumed samples: 2871296 | elapsed time per iteration (ms): 5629.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.013283E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:15:19.770919 | finish at 2025-09-10 11:44:12 + [2025-09-09 21:28:58] iteration 2805/ 11920 | consumed samples: 2872320 | elapsed time per iteration (ms): 5627.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.016118E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:14:49.921131 | finish at 2025-09-10 11:43:48 + [2025-09-09 21:29:03] iteration 2806/ 11920 | consumed samples: 2873344 | elapsed time per iteration (ms): 5632.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.021436E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:15:33.787364 | finish at 2025-09-10 11:44:37 + [2025-09-09 21:29:09] iteration 2807/ 11920 | consumed samples: 2874368 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.021076E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:15:00.031414 | finish at 2025-09-10 11:44:09 + [2025-09-09 21:29:15] iteration 2808/ 11920 | consumed samples: 2875392 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.008620E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:13:41.337576 | finish at 2025-09-10 11:42:56 + [2025-09-09 21:29:20] iteration 2809/ 11920 | consumed samples: 2876416 | elapsed time per iteration (ms): 5629.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.018014E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:14:53.577742 | finish at 2025-09-10 11:44:14 + [2025-09-09 21:29:26] iteration 2810/ 11920 | consumed samples: 2877440 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.009553E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:14:42.474468 | finish at 2025-09-10 11:44:08 + [2025-09-09 21:29:32] iteration 2811/ 11920 | consumed samples: 2878464 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.017230E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:14:19.282236 | finish at 2025-09-10 11:43:51 + [2025-09-09 21:29:37] iteration 2812/ 11920 | consumed samples: 2879488 | elapsed time per iteration (ms): 5623.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.007666E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:13:34.808656 | finish at 2025-09-10 11:43:12 + [2025-09-09 21:29:43] iteration 2813/ 11920 | consumed samples: 2880512 | elapsed time per iteration (ms): 5621.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.004969E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:13:10.310678 | finish at 2025-09-10 11:42:53 + [2025-09-09 21:29:48] iteration 2814/ 11920 | consumed samples: 2881536 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.023274E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:13:36.428121 | finish at 2025-09-10 11:43:25 + [2025-09-09 21:29:54] iteration 2815/ 11920 | consumed samples: 2882560 | elapsed time per iteration (ms): 5630.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.031531E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:14:22.049754 | finish at 2025-09-10 11:44:16 + [2025-09-09 21:30:00] iteration 2816/ 11920 | consumed samples: 2883584 | elapsed time per iteration (ms): 5628.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.009356E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:14:01.675022 | finish at 2025-09-10 11:44:01 + [2025-09-09 21:30:05] iteration 2817/ 11920 | consumed samples: 2884608 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.020826E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:13:16.906914 | finish at 2025-09-10 11:43:22 + [2025-09-09 21:30:11] iteration 2818/ 11920 | consumed samples: 2885632 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.021547E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:12:36.157724 | finish at 2025-09-10 11:42:47 + [2025-09-09 21:30:17] iteration 2819/ 11920 | consumed samples: 2886656 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.013937E+00 | loss scale: 1.0 | grad norm: 0.245 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:13:51.930552 | finish at 2025-09-10 11:44:09 + [2025-09-09 21:30:22] iteration 2820/ 11920 | consumed samples: 2887680 | elapsed time per iteration (ms): 5628.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.013102E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:13:36.525030 | finish at 2025-09-10 11:43:59 + [2025-09-09 21:30:28] iteration 2821/ 11920 | consumed samples: 2888704 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.026483E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:12:45.437679 | finish at 2025-09-10 11:43:13 + [2025-09-09 21:30:33] iteration 2822/ 11920 | consumed samples: 2889728 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.009743E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:12:31.493694 | finish at 2025-09-10 11:43:05 + [2025-09-09 21:30:39] iteration 2823/ 11920 | consumed samples: 2890752 | elapsed time per iteration (ms): 5618.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.023210E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:11:52.073543 | finish at 2025-09-10 11:42:31 + [2025-09-09 21:30:45] iteration 2824/ 11920 | consumed samples: 2891776 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.016979E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:12:41.950871 | finish at 2025-09-10 11:43:27 + [2025-09-09 21:30:50] iteration 2825/ 11920 | consumed samples: 2892800 | elapsed time per iteration (ms): 5626.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.996988E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:12:52.695585 | finish at 2025-09-10 11:43:43 + [2025-09-09 21:30:56] iteration 2826/ 11920 | consumed samples: 2893824 | elapsed time per iteration (ms): 5623.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.999668E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:12:15.151364 | finish at 2025-09-10 11:43:11 + [2025-09-09 21:31:02] iteration 2827/ 11920 | consumed samples: 2894848 | elapsed time per iteration (ms): 5629.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.004650E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:13:09.437267 | finish at 2025-09-10 11:44:11 + [2025-09-09 21:31:08] iteration 2828/ 11920 | consumed samples: 2895872 | elapsed time per iteration (ms): 5991.1 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.018090E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:07:51.441701 | finish at 2025-09-10 12:38:59 + [2025-09-09 21:31:13] iteration 2829/ 11920 | consumed samples: 2896896 | elapsed time per iteration (ms): 5619.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.022377E+00 | loss scale: 1.0 | grad norm: 0.255 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:11:27.586884 | finish at 2025-09-10 11:42:41 + [2025-09-09 21:31:19] iteration 2830/ 11920 | consumed samples: 2897920 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.011448E+00 | loss scale: 1.0 | grad norm: 0.242 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:11:46.411436 | finish at 2025-09-10 11:43:05 + [2025-09-09 21:31:25] iteration 2831/ 11920 | consumed samples: 2898944 | elapsed time per iteration (ms): 5874.1 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.001621E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:49:49.542937 | finish at 2025-09-10 12:21:14 + [2025-09-09 21:31:30] iteration 2832/ 11920 | consumed samples: 2899968 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.024041E+00 | loss scale: 1.0 | grad norm: 0.229 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:11:09.683777 | finish at 2025-09-10 11:42:40 + [2025-09-09 21:31:36] iteration 2833/ 11920 | consumed samples: 2900992 | elapsed time per iteration (ms): 5944.5 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.015021E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:00:17.677722 | finish at 2025-09-10 12:31:54 + [2025-09-09 21:31:42] iteration 2834/ 11920 | consumed samples: 2902016 | elapsed time per iteration (ms): 5953.3 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.018724E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:01:31.402178 | finish at 2025-09-10 12:33:14 + [2025-09-09 21:31:48] iteration 2835/ 11920 | consumed samples: 2903040 | elapsed time per iteration (ms): 5618.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.026360E+00 | loss scale: 1.0 | grad norm: 0.229 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:10:46.972766 | finish at 2025-09-10 11:42:35 + [2025-09-09 21:31:53] iteration 2836/ 11920 | consumed samples: 2904064 | elapsed time per iteration (ms): 5618.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.011018E+00 | loss scale: 1.0 | grad norm: 0.258 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:10:36.953053 | finish at 2025-09-10 11:42:30 + [2025-09-09 21:31:59] iteration 2837/ 11920 | consumed samples: 2905088 | elapsed time per iteration (ms): 5632.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.022452E+00 | loss scale: 1.0 | grad norm: 0.242 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:12:36.276469 | finish at 2025-09-10 11:44:35 + [2025-09-09 21:32:05] iteration 2838/ 11920 | consumed samples: 2906112 | elapsed time per iteration (ms): 5629.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.016397E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:12:08.335112 | finish at 2025-09-10 11:44:13 + [2025-09-09 21:32:11] iteration 2839/ 11920 | consumed samples: 2907136 | elapsed time per iteration (ms): 5963.5 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.023356E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:02:34.115022 | finish at 2025-09-10 12:34:45 + [2025-09-09 21:32:16] iteration 2840/ 11920 | consumed samples: 2908160 | elapsed time per iteration (ms): 5617.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.999312E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:10:08.316412 | finish at 2025-09-10 11:42:25 + [2025-09-09 21:32:22] iteration 2841/ 11920 | consumed samples: 2909184 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.021327E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:11:06.686564 | finish at 2025-09-10 11:43:29 + [2025-09-09 21:32:28] iteration 2842/ 11920 | consumed samples: 2910208 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.020115E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:10:24.261181 | finish at 2025-09-10 11:42:52 + [2025-09-09 21:32:33] iteration 2843/ 11920 | consumed samples: 2911232 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.017457E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:10:17.664511 | finish at 2025-09-10 11:42:51 + [2025-09-09 21:32:39] iteration 2844/ 11920 | consumed samples: 2912256 | elapsed time per iteration (ms): 5615.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.014824E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:09:22.151225 | finish at 2025-09-10 11:42:01 + [2025-09-09 21:32:45] iteration 2845/ 11920 | consumed samples: 2913280 | elapsed time per iteration (ms): 5908.7 | throughput per GPU (TFLOP/s/GPU): 76.4 | MFU 7.73% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.999574E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:53:41.898115 | finish at 2025-09-10 12:26:27 + [2025-09-09 21:32:50] iteration 2846/ 11920 | consumed samples: 2914304 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.014924E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:10:39.441389 | finish at 2025-09-10 11:43:30 + [2025-09-09 21:32:56] iteration 2847/ 11920 | consumed samples: 2915328 | elapsed time per iteration (ms): 5883.2 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.001712E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:49:37.931185 | finish at 2025-09-10 12:22:34 + [2025-09-09 21:33:02] iteration 2848/ 11920 | consumed samples: 2916352 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.016577E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:10:21.886837 | finish at 2025-09-10 11:43:24 + [2025-09-09 21:33:07] iteration 2849/ 11920 | consumed samples: 2917376 | elapsed time per iteration (ms): 5619.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.004109E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:09:36.668113 | finish at 2025-09-10 11:42:44 + [2025-09-09 21:33:14] iteration 2850/ 11920 | consumed samples: 2918400 | elapsed time per iteration (ms): 6311.8 | throughput per GPU (TFLOP/s/GPU): 71.5 | MFU 7.23% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.014584E+00 | loss scale: 1.0 | grad norm: 0.134 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:54:07.716796 | finish at 2025-09-10 13:27:21 + [2025-09-09 21:33:19] iteration 2851/ 11920 | consumed samples: 2919424 | elapsed time per iteration (ms): 5616.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.999240E+00 | loss scale: 1.0 | grad norm: 0.118 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:08:56.022465 | finish at 2025-09-10 11:42:15 + [2025-09-09 21:33:25] iteration 2852/ 11920 | consumed samples: 2920448 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.000443E+00 | loss scale: 1.0 | grad norm: 0.130 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:09:34.503866 | finish at 2025-09-10 11:43:00 + [2025-09-09 21:33:31] iteration 2853/ 11920 | consumed samples: 2921472 | elapsed time per iteration (ms): 5931.3 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.012100E+00 | loss scale: 1.0 | grad norm: 0.129 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:56:19.522282 | finish at 2025-09-10 12:29:50 + [2025-09-09 21:33:37] iteration 2854/ 11920 | consumed samples: 2922496 | elapsed time per iteration (ms): 5619.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.998997E+00 | loss scale: 1.0 | grad norm: 0.121 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:09:02.398318 | finish at 2025-09-10 11:42:39 + [2025-09-09 21:33:42] iteration 2855/ 11920 | consumed samples: 2923520 | elapsed time per iteration (ms): 5618.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.003137E+00 | loss scale: 1.0 | grad norm: 0.123 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:08:52.644759 | finish at 2025-09-10 11:42:35 + [2025-09-09 21:33:48] iteration 2856/ 11920 | consumed samples: 2924544 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.991563E+00 | loss scale: 1.0 | grad norm: 0.122 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:08:57.474716 | finish at 2025-09-10 11:42:45 + [2025-09-09 21:33:53] iteration 2857/ 11920 | consumed samples: 2925568 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.997384E+00 | loss scale: 1.0 | grad norm: 0.126 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:09:01.407801 | finish at 2025-09-10 11:42:55 + [2025-09-09 21:33:59] iteration 2858/ 11920 | consumed samples: 2926592 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.006289E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:09:32.075574 | finish at 2025-09-10 11:43:31 + [2025-09-09 21:34:05] iteration 2859/ 11920 | consumed samples: 2927616 | elapsed time per iteration (ms): 5627.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.001055E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:09:52.905925 | finish at 2025-09-10 11:43:58 + [2025-09-09 21:34:10] iteration 2860/ 11920 | consumed samples: 2928640 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.002659E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:08:56.747618 | finish at 2025-09-10 11:43:07 + [2025-09-09 21:34:16] iteration 2861/ 11920 | consumed samples: 2929664 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.995589E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:08:36.743126 | finish at 2025-09-10 11:42:53 + [2025-09-09 21:34:22] iteration 2862/ 11920 | consumed samples: 2930688 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.999067E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:08:37.614300 | finish at 2025-09-10 11:42:59 + [2025-09-09 21:34:27] iteration 2863/ 11920 | consumed samples: 2931712 | elapsed time per iteration (ms): 5628.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.016985E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:09:34.430823 | finish at 2025-09-10 11:44:02 + [2025-09-09 21:34:33] iteration 2864/ 11920 | consumed samples: 2932736 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.010593E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:08:39.915878 | finish at 2025-09-10 11:43:13 + [2025-09-09 21:34:38] iteration 2865/ 11920 | consumed samples: 2933760 | elapsed time per iteration (ms): 5618.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.008096E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:07:54.312793 | finish at 2025-09-10 11:42:33 + [2025-09-09 21:34:44] iteration 2866/ 11920 | consumed samples: 2934784 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.001356E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:08:13.577090 | finish at 2025-09-10 11:42:58 + [2025-09-09 21:34:50] iteration 2867/ 11920 | consumed samples: 2935808 | elapsed time per iteration (ms): 5975.9 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.988718E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:01:39.731457 | finish at 2025-09-10 12:36:30 + [2025-09-09 21:34:56] iteration 2868/ 11920 | consumed samples: 2936832 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.016601E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:08:17.215407 | finish at 2025-09-10 11:43:13 + [2025-09-09 21:35:01] iteration 2869/ 11920 | consumed samples: 2937856 | elapsed time per iteration (ms): 5632.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.025226E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:09:38.878620 | finish at 2025-09-10 11:44:40 + [2025-09-09 21:35:07] iteration 2870/ 11920 | consumed samples: 2938880 | elapsed time per iteration (ms): 5838.1 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.007415E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:40:34.981537 | finish at 2025-09-10 12:15:42 + [2025-09-09 21:35:13] iteration 2871/ 11920 | consumed samples: 2939904 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.012528E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:07:59.419427 | finish at 2025-09-10 11:43:12 + [2025-09-09 21:35:19] iteration 2872/ 11920 | consumed samples: 2940928 | elapsed time per iteration (ms): 5839.7 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.014181E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:40:37.913933 | finish at 2025-09-10 12:15:56 + [2025-09-09 21:35:25] iteration 2873/ 11920 | consumed samples: 2941952 | elapsed time per iteration (ms): 5957.3 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.029351E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:58:15.806947 | finish at 2025-09-10 12:33:40 + [2025-09-09 21:35:30] iteration 2874/ 11920 | consumed samples: 2942976 | elapsed time per iteration (ms): 5967.6 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.025081E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:59:42.897629 | finish at 2025-09-10 12:35:13 + [2025-09-09 21:35:36] iteration 2875/ 11920 | consumed samples: 2944000 | elapsed time per iteration (ms): 5873.1 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.019474E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:45:21.790931 | finish at 2025-09-10 12:20:58 + [2025-09-09 21:35:42] iteration 2876/ 11920 | consumed samples: 2945024 | elapsed time per iteration (ms): 5621.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.020274E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:07:15.983777 | finish at 2025-09-10 11:42:58 + [2025-09-09 21:35:48] iteration 2877/ 11920 | consumed samples: 2946048 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.010436E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:07:11.214442 | finish at 2025-09-10 11:42:59 + [2025-09-09 21:35:53] iteration 2878/ 11920 | consumed samples: 2947072 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.004494E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:07:15.272841 | finish at 2025-09-10 11:43:08 + [2025-09-09 21:35:59] iteration 2879/ 11920 | consumed samples: 2948096 | elapsed time per iteration (ms): 5636.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.010995E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:09:18.916435 | finish at 2025-09-10 11:45:18 + [2025-09-09 21:36:04] iteration 2880/ 11920 | consumed samples: 2949120 | elapsed time per iteration (ms): 5629.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.007195E+00 | loss scale: 1.0 | grad norm: 0.288 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:08:09.754581 | finish at 2025-09-10 11:44:14 + [2025-09-09 21:36:10] iteration 2881/ 11920 | consumed samples: 2950144 | elapsed time per iteration (ms): 5921.3 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.020164E+00 | loss scale: 1.0 | grad norm: 0.402 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:52:02.630107 | finish at 2025-09-10 12:28:13 + [2025-09-09 21:36:16] iteration 2882/ 11920 | consumed samples: 2951168 | elapsed time per iteration (ms): 5876.8 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.057634E+00 | loss scale: 1.0 | grad norm: 0.440 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:45:14.967149 | finish at 2025-09-10 12:21:31 + [2025-09-09 21:36:22] iteration 2883/ 11920 | consumed samples: 2952192 | elapsed time per iteration (ms): 5650.0 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.118755E+00 | loss scale: 1.0 | grad norm: 0.934 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:10:58.637181 | finish at 2025-09-10 11:47:21 + [2025-09-09 21:36:28] iteration 2884/ 11920 | consumed samples: 2953216 | elapsed time per iteration (ms): 5688.6 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.240252E+00 | loss scale: 1.0 | grad norm: 1.367 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:16:42.593036 | finish at 2025-09-10 11:53:10 + [2025-09-09 21:36:33] iteration 2885/ 11920 | consumed samples: 2954240 | elapsed time per iteration (ms): 5731.5 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.803979E+00 | loss scale: 1.0 | grad norm: 4.002 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:23:04.155543 | finish at 2025-09-10 11:59:38 + [2025-09-09 21:36:39] iteration 2886/ 11920 | consumed samples: 2955264 | elapsed time per iteration (ms): 5717.1 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.418847E+00 | loss scale: 1.0 | grad norm: 1.384 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:20:48.444236 | finish at 2025-09-10 11:57:28 + [2025-09-09 21:36:45] iteration 2887/ 11920 | consumed samples: 2956288 | elapsed time per iteration (ms): 5679.2 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.450167E+00 | loss scale: 1.0 | grad norm: 1.496 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:15:00.372373 | finish at 2025-09-10 11:51:45 + [2025-09-09 21:36:50] iteration 2888/ 11920 | consumed samples: 2957312 | elapsed time per iteration (ms): 5695.4 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.413843E+00 | loss scale: 1.0 | grad norm: 1.396 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:17:20.469492 | finish at 2025-09-10 11:54:11 + [2025-09-09 21:36:56] iteration 2889/ 11920 | consumed samples: 2958336 | elapsed time per iteration (ms): 5695.1 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.405519E+00 | loss scale: 1.0 | grad norm: 0.885 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:17:12.011632 | finish at 2025-09-10 11:54:08 + [2025-09-09 21:37:03] iteration 2890/ 11920 | consumed samples: 2959360 | elapsed time per iteration (ms): 6518.2 | throughput per GPU (TFLOP/s/GPU): 69.3 | MFU 7.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.452711E+00 | loss scale: 1.0 | grad norm: 1.227 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 16:20:59.392648 | finish at 2025-09-10 13:58:02 + [2025-09-09 21:37:08] iteration 2891/ 11920 | consumed samples: 2960384 | elapsed time per iteration (ms): 5698.2 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.552234E+00 | loss scale: 1.0 | grad norm: 1.130 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:17:28.916373 | finish at 2025-09-10 11:54:37 + [2025-09-09 21:37:14] iteration 2892/ 11920 | consumed samples: 2961408 | elapsed time per iteration (ms): 5705.5 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 8.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.492540E+00 | loss scale: 1.0 | grad norm: 0.998 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:18:29.052807 | finish at 2025-09-10 11:55:43 + [2025-09-09 21:37:20] iteration 2893/ 11920 | consumed samples: 2962432 | elapsed time per iteration (ms): 5977.3 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.483814E+00 | loss scale: 1.0 | grad norm: 1.013 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:59:17.497527 | finish at 2025-09-10 12:36:38 + [2025-09-09 21:37:26] iteration 2894/ 11920 | consumed samples: 2963456 | elapsed time per iteration (ms): 5707.6 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 8.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.491322E+00 | loss scale: 1.0 | grad norm: 0.969 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:18:36.359653 | finish at 2025-09-10 11:56:02 + [2025-09-09 21:37:32] iteration 2895/ 11920 | consumed samples: 2964480 | elapsed time per iteration (ms): 6103.4 | throughput per GPU (TFLOP/s/GPU): 74.0 | MFU 7.48% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.492406E+00 | loss scale: 1.0 | grad norm: 1.009 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:18:02.801920 | finish at 2025-09-10 12:55:35 + [2025-09-09 21:37:38] iteration 2896/ 11920 | consumed samples: 2965504 | elapsed time per iteration (ms): 5689.3 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.499658E+00 | loss scale: 1.0 | grad norm: 0.764 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:15:40.288925 | finish at 2025-09-10 11:53:18 + [2025-09-09 21:37:44] iteration 2897/ 11920 | consumed samples: 2966528 | elapsed time per iteration (ms): 5951.8 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.604207E+00 | loss scale: 1.0 | grad norm: 1.518 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:55:02.997719 | finish at 2025-09-10 12:32:46 + [2025-09-09 21:37:49] iteration 2898/ 11920 | consumed samples: 2967552 | elapsed time per iteration (ms): 5717.1 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.593436E+00 | loss scale: 1.0 | grad norm: 1.067 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:19:39.561339 | finish at 2025-09-10 11:57:29 + [2025-09-09 21:37:55] iteration 2899/ 11920 | consumed samples: 2968576 | elapsed time per iteration (ms): 5697.4 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.579033E+00 | loss scale: 1.0 | grad norm: 1.064 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:16:36.523690 | finish at 2025-09-10 11:54:31 + [2025-09-09 21:38:01] iteration 2900/ 11920 | consumed samples: 2969600 | elapsed time per iteration (ms): 5984.8 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.534666E+00 | loss scale: 1.0 | grad norm: 0.938 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:59:43.152819 | finish at 2025-09-10 12:37:44 + [2025-09-09 21:38:07] iteration 2901/ 11920 | consumed samples: 2970624 | elapsed time per iteration (ms): 5714.5 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.619307E+00 | loss scale: 1.0 | grad norm: 1.161 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:18:59.397597 | finish at 2025-09-10 11:57:06 + [2025-09-09 21:38:12] iteration 2902/ 11920 | consumed samples: 2971648 | elapsed time per iteration (ms): 5663.6 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.505753E+00 | loss scale: 1.0 | grad norm: 0.521 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:11:14.237008 | finish at 2025-09-10 11:49:27 + [2025-09-09 21:38:18] iteration 2903/ 11920 | consumed samples: 2972672 | elapsed time per iteration (ms): 5686.8 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.582203E+00 | loss scale: 1.0 | grad norm: 1.563 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:14:38.077711 | finish at 2025-09-10 11:52:56 + [2025-09-09 21:38:24] iteration 2904/ 11920 | consumed samples: 2973696 | elapsed time per iteration (ms): 5727.4 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.010918E+00 | loss scale: 1.0 | grad norm: 5.415 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:20:38.094959 | finish at 2025-09-10 11:59:02 + [2025-09-09 21:38:29] iteration 2905/ 11920 | consumed samples: 2974720 | elapsed time per iteration (ms): 5699.5 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.753032E+00 | loss scale: 1.0 | grad norm: 1.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:16:21.276970 | finish at 2025-09-10 11:54:51 + [2025-09-09 21:38:35] iteration 2906/ 11920 | consumed samples: 2975744 | elapsed time per iteration (ms): 5685.7 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.667167E+00 | loss scale: 1.0 | grad norm: 0.951 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:14:10.566145 | finish at 2025-09-10 11:52:46 + [2025-09-09 21:38:41] iteration 2907/ 11920 | consumed samples: 2976768 | elapsed time per iteration (ms): 5759.4 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.339458E+00 | loss scale: 1.0 | grad norm: 5.718 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:25:09.651721 | finish at 2025-09-10 12:03:50 + [2025-09-09 21:38:47] iteration 2908/ 11920 | consumed samples: 2977792 | elapsed time per iteration (ms): 5722.3 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.226355E+00 | loss scale: 1.0 | grad norm: 3.171 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:19:29.153212 | finish at 2025-09-10 11:58:16 + [2025-09-09 21:38:53] iteration 2909/ 11920 | consumed samples: 2978816 | elapsed time per iteration (ms): 5940.8 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.079686E+00 | loss scale: 1.0 | grad norm: 1.971 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:52:12.812615 | finish at 2025-09-10 12:31:05 + [2025-09-09 21:38:58] iteration 2910/ 11920 | consumed samples: 2979840 | elapsed time per iteration (ms): 5797.2 | throughput per GPU (TFLOP/s/GPU): 77.9 | MFU 7.87% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.745682E+00 | loss scale: 1.0 | grad norm: 4.590 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:30:32.724421 | finish at 2025-09-10 12:09:31 + [2025-09-09 21:39:04] iteration 2911/ 11920 | consumed samples: 2980864 | elapsed time per iteration (ms): 5716.3 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.186079E+00 | loss scale: 1.0 | grad norm: 1.461 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:18:17.919117 | finish at 2025-09-10 11:57:22 + [2025-09-09 21:39:10] iteration 2912/ 11920 | consumed samples: 2981888 | elapsed time per iteration (ms): 5713.5 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.552499E+00 | loss scale: 1.0 | grad norm: 2.989 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:17:47.584049 | finish at 2025-09-10 11:56:57 + [2025-09-09 21:39:15] iteration 2913/ 11920 | consumed samples: 2982912 | elapsed time per iteration (ms): 5704.7 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 8.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.154271E+00 | loss scale: 1.0 | grad norm: 1.080 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:16:22.376716 | finish at 2025-09-10 11:55:38 + [2025-09-09 21:39:21] iteration 2914/ 11920 | consumed samples: 2983936 | elapsed time per iteration (ms): 5698.8 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.120574E+00 | loss scale: 1.0 | grad norm: 1.445 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:15:23.425791 | finish at 2025-09-10 11:54:45 + [2025-09-09 21:39:27] iteration 2915/ 11920 | consumed samples: 2984960 | elapsed time per iteration (ms): 5705.6 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 8.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.024645E+00 | loss scale: 1.0 | grad norm: 0.886 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:16:19.211608 | finish at 2025-09-10 11:55:46 + [2025-09-09 21:39:33] iteration 2916/ 11920 | consumed samples: 2985984 | elapsed time per iteration (ms): 5710.5 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.005136E+00 | loss scale: 1.0 | grad norm: 0.990 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:16:57.644705 | finish at 2025-09-10 11:56:30 + [2025-09-09 21:39:38] iteration 2917/ 11920 | consumed samples: 2987008 | elapsed time per iteration (ms): 5709.1 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 8.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.991200E+00 | loss scale: 1.0 | grad norm: 1.187 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:16:38.812724 | finish at 2025-09-10 11:56:17 + [2025-09-09 21:39:44] iteration 2918/ 11920 | consumed samples: 2988032 | elapsed time per iteration (ms): 5720.5 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.979956E+00 | loss scale: 1.0 | grad norm: 1.149 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:18:16.181311 | finish at 2025-09-10 11:58:00 + [2025-09-09 21:39:50] iteration 2919/ 11920 | consumed samples: 2989056 | elapsed time per iteration (ms): 5690.2 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.956113E+00 | loss scale: 1.0 | grad norm: 1.240 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:13:37.501744 | finish at 2025-09-10 11:53:27 + [2025-09-09 21:39:55] iteration 2920/ 11920 | consumed samples: 2990080 | elapsed time per iteration (ms): 5710.2 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.993366E+00 | loss scale: 1.0 | grad norm: 1.429 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:16:31.933680 | finish at 2025-09-10 11:56:27 + [2025-09-09 21:40:01] iteration 2921/ 11920 | consumed samples: 2991104 | elapsed time per iteration (ms): 5721.5 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.895228E+00 | loss scale: 1.0 | grad norm: 0.791 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:18:07.872182 | finish at 2025-09-10 11:58:09 + [2025-09-09 21:40:07] iteration 2922/ 11920 | consumed samples: 2992128 | elapsed time per iteration (ms): 5711.4 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.854955E+00 | loss scale: 1.0 | grad norm: 0.756 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:16:31.278317 | finish at 2025-09-10 11:56:38 + [2025-09-09 21:40:12] iteration 2923/ 11920 | consumed samples: 2993152 | elapsed time per iteration (ms): 5676.9 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.845346E+00 | loss scale: 1.0 | grad norm: 0.941 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:11:14.946222 | finish at 2025-09-10 11:51:27 + [2025-09-09 21:40:18] iteration 2924/ 11920 | consumed samples: 2994176 | elapsed time per iteration (ms): 5664.0 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.806062E+00 | loss scale: 1.0 | grad norm: 0.623 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:09:13.378626 | finish at 2025-09-10 11:49:32 + [2025-09-09 21:40:24] iteration 2925/ 11920 | consumed samples: 2995200 | elapsed time per iteration (ms): 5665.2 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.758940E+00 | loss scale: 1.0 | grad norm: 0.527 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:09:18.405329 | finish at 2025-09-10 11:49:42 + [2025-09-09 21:40:29] iteration 2926/ 11920 | consumed samples: 2996224 | elapsed time per iteration (ms): 5661.1 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.728205E+00 | loss scale: 1.0 | grad norm: 0.584 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:08:35.896143 | finish at 2025-09-10 11:49:05 + [2025-09-09 21:40:35] iteration 2927/ 11920 | consumed samples: 2997248 | elapsed time per iteration (ms): 5686.0 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.746509E+00 | loss scale: 1.0 | grad norm: 0.777 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:12:14.078908 | finish at 2025-09-10 11:52:49 + [2025-09-09 21:40:41] iteration 2928/ 11920 | consumed samples: 2998272 | elapsed time per iteration (ms): 5680.9 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.743630E+00 | loss scale: 1.0 | grad norm: 0.904 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:11:22.979538 | finish at 2025-09-10 11:52:04 + [2025-09-09 21:40:47] iteration 2929/ 11920 | consumed samples: 2999296 | elapsed time per iteration (ms): 5671.9 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.681073E+00 | loss scale: 1.0 | grad norm: 0.722 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:09:56.361887 | finish at 2025-09-10 11:50:43 + [2025-09-09 21:40:52] iteration 2930/ 11920 | consumed samples: 3000320 | elapsed time per iteration (ms): 5673.1 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.671804E+00 | loss scale: 1.0 | grad norm: 0.769 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:10:00.796003 | finish at 2025-09-10 11:50:53 + [2025-09-09 21:40:58] iteration 2931/ 11920 | consumed samples: 3001344 | elapsed time per iteration (ms): 5690.9 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.694152E+00 | loss scale: 1.0 | grad norm: 0.994 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:12:35.190129 | finish at 2025-09-10 11:53:33 + [2025-09-09 21:41:04] iteration 2932/ 11920 | consumed samples: 3002368 | elapsed time per iteration (ms): 5698.6 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.741828E+00 | loss scale: 1.0 | grad norm: 1.444 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:13:39.257289 | finish at 2025-09-10 11:54:43 + [2025-09-09 21:41:09] iteration 2933/ 11920 | consumed samples: 3003392 | elapsed time per iteration (ms): 5675.3 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.639037E+00 | loss scale: 1.0 | grad norm: 0.513 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:10:03.534367 | finish at 2025-09-10 11:51:13 + [2025-09-09 21:41:15] iteration 2934/ 11920 | consumed samples: 3004416 | elapsed time per iteration (ms): 6079.8 | throughput per GPU (TFLOP/s/GPU): 74.3 | MFU 7.51% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.669375E+00 | loss scale: 1.0 | grad norm: 1.562 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:10:33.523157 | finish at 2025-09-10 12:51:49 + [2025-09-09 21:41:21] iteration 2935/ 11920 | consumed samples: 3005440 | elapsed time per iteration (ms): 5886.4 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.675472E+00 | loss scale: 1.0 | grad norm: 1.371 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:41:29.025375 | finish at 2025-09-10 12:22:50 + [2025-09-09 21:41:27] iteration 2936/ 11920 | consumed samples: 3006464 | elapsed time per iteration (ms): 5680.3 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.635517E+00 | loss scale: 1.0 | grad norm: 0.666 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:10:31.729498 | finish at 2025-09-10 11:51:59 + [2025-09-09 21:41:33] iteration 2937/ 11920 | consumed samples: 3007488 | elapsed time per iteration (ms): 5665.3 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.575938E+00 | loss scale: 1.0 | grad norm: 0.569 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:08:11.059110 | finish at 2025-09-10 11:49:44 + [2025-09-09 21:41:39] iteration 2938/ 11920 | consumed samples: 3008512 | elapsed time per iteration (ms): 6010.0 | throughput per GPU (TFLOP/s/GPU): 75.1 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.569031E+00 | loss scale: 1.0 | grad norm: 0.721 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:59:42.031920 | finish at 2025-09-10 12:41:21 + [2025-09-09 21:41:44] iteration 2939/ 11920 | consumed samples: 3009536 | elapsed time per iteration (ms): 5660.2 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.549492E+00 | loss scale: 1.0 | grad norm: 0.654 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:07:14.276540 | finish at 2025-09-10 11:48:59 + [2025-09-09 21:41:50] iteration 2940/ 11920 | consumed samples: 3010560 | elapsed time per iteration (ms): 5667.3 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.544148E+00 | loss scale: 1.0 | grad norm: 0.786 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:08:12.749958 | finish at 2025-09-10 11:50:03 + [2025-09-09 21:41:56] iteration 2941/ 11920 | consumed samples: 3011584 | elapsed time per iteration (ms): 5671.7 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.530271E+00 | loss scale: 1.0 | grad norm: 0.651 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:08:45.930993 | finish at 2025-09-10 11:50:42 + [2025-09-09 21:42:01] iteration 2942/ 11920 | consumed samples: 3012608 | elapsed time per iteration (ms): 5907.1 | throughput per GPU (TFLOP/s/GPU): 76.4 | MFU 7.73% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.516653E+00 | loss scale: 1.0 | grad norm: 0.737 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:43:54.183199 | finish at 2025-09-10 12:25:56 + [2025-09-09 21:42:07] iteration 2943/ 11920 | consumed samples: 3013632 | elapsed time per iteration (ms): 5686.2 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.499808E+00 | loss scale: 1.0 | grad norm: 0.500 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:10:45.236983 | finish at 2025-09-10 11:52:52 + [2025-09-09 21:42:13] iteration 2944/ 11920 | consumed samples: 3014656 | elapsed time per iteration (ms): 5670.0 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.486174E+00 | loss scale: 1.0 | grad norm: 0.592 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:08:14.168930 | finish at 2025-09-10 11:50:27 + [2025-09-09 21:42:19] iteration 2945/ 11920 | consumed samples: 3015680 | elapsed time per iteration (ms): 5994.5 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.436218E+00 | loss scale: 1.0 | grad norm: 0.281 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:56:40.805843 | finish at 2025-09-10 12:39:00 + [2025-09-09 21:42:25] iteration 2946/ 11920 | consumed samples: 3016704 | elapsed time per iteration (ms): 5662.7 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.414299E+00 | loss scale: 1.0 | grad norm: 0.337 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:06:56.675561 | finish at 2025-09-10 11:49:21 + [2025-09-09 21:42:30] iteration 2947/ 11920 | consumed samples: 3017728 | elapsed time per iteration (ms): 5657.8 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.414731E+00 | loss scale: 1.0 | grad norm: 0.625 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:06:07.496796 | finish at 2025-09-10 11:48:38 + [2025-09-09 21:42:36] iteration 2948/ 11920 | consumed samples: 3018752 | elapsed time per iteration (ms): 5681.7 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.426096E+00 | loss scale: 1.0 | grad norm: 0.818 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:09:36.212321 | finish at 2025-09-10 11:52:12 + [2025-09-09 21:42:42] iteration 2949/ 11920 | consumed samples: 3019776 | elapsed time per iteration (ms): 5677.6 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.384304E+00 | loss scale: 1.0 | grad norm: 0.292 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:08:53.423659 | finish at 2025-09-10 11:51:35 + [2025-09-09 21:42:47] iteration 2950/ 11920 | consumed samples: 3020800 | elapsed time per iteration (ms): 5662.6 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.377423E+00 | loss scale: 1.0 | grad norm: 0.366 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:06:33.947947 | finish at 2025-09-10 11:49:21 + [2025-09-09 21:42:53] iteration 2951/ 11920 | consumed samples: 3021824 | elapsed time per iteration (ms): 5994.6 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.363879E+00 | loss scale: 1.0 | grad norm: 0.425 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:56:05.833075 | finish at 2025-09-10 12:38:59 + [2025-09-09 21:42:59] iteration 2952/ 11920 | consumed samples: 3022848 | elapsed time per iteration (ms): 5676.4 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.339001E+00 | loss scale: 1.0 | grad norm: 0.409 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:08:25.518538 | finish at 2025-09-10 11:51:24 + [2025-09-09 21:43:05] iteration 2953/ 11920 | consumed samples: 3023872 | elapsed time per iteration (ms): 5659.6 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.329938E+00 | loss scale: 1.0 | grad norm: 0.324 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:05:50.022472 | finish at 2025-09-10 11:48:55 + [2025-09-09 21:43:10] iteration 2954/ 11920 | consumed samples: 3024896 | elapsed time per iteration (ms): 5656.9 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.320228E+00 | loss scale: 1.0 | grad norm: 0.380 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:05:19.790416 | finish at 2025-09-10 11:48:30 + [2025-09-09 21:43:16] iteration 2955/ 11920 | consumed samples: 3025920 | elapsed time per iteration (ms): 5670.1 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.307368E+00 | loss scale: 1.0 | grad norm: 0.456 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:07:12.401378 | finish at 2025-09-10 11:50:28 + [2025-09-09 21:43:22] iteration 2956/ 11920 | consumed samples: 3026944 | elapsed time per iteration (ms): 5664.0 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.316228E+00 | loss scale: 1.0 | grad norm: 0.484 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:06:11.681694 | finish at 2025-09-10 11:49:33 + [2025-09-09 21:43:27] iteration 2957/ 11920 | consumed samples: 3027968 | elapsed time per iteration (ms): 5967.2 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.318828E+00 | loss scale: 1.0 | grad norm: 0.608 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:51:24.180648 | finish at 2025-09-10 12:34:52 + [2025-09-09 21:43:33] iteration 2958/ 11920 | consumed samples: 3028992 | elapsed time per iteration (ms): 5654.8 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.284092E+00 | loss scale: 1.0 | grad norm: 0.382 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:04:38.064915 | finish at 2025-09-10 11:48:11 + [2025-09-09 21:43:39] iteration 2959/ 11920 | consumed samples: 3030016 | elapsed time per iteration (ms): 5653.3 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.281741E+00 | loss scale: 1.0 | grad norm: 0.430 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:04:19.110624 | finish at 2025-09-10 11:47:58 + [2025-09-09 21:43:44] iteration 2960/ 11920 | consumed samples: 3031040 | elapsed time per iteration (ms): 5655.2 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.284293E+00 | loss scale: 1.0 | grad norm: 0.282 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:04:30.453186 | finish at 2025-09-10 11:48:15 + [2025-09-09 21:43:50] iteration 2961/ 11920 | consumed samples: 3032064 | elapsed time per iteration (ms): 5873.7 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.259728E+00 | loss scale: 1.0 | grad norm: 0.349 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:37:02.483843 | finish at 2025-09-10 12:20:53 + [2025-09-09 21:43:56] iteration 2962/ 11920 | consumed samples: 3033088 | elapsed time per iteration (ms): 5663.2 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.255540E+00 | loss scale: 1.0 | grad norm: 0.649 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:05:30.961804 | finish at 2025-09-10 11:49:27 + [2025-09-09 21:44:02] iteration 2963/ 11920 | consumed samples: 3034112 | elapsed time per iteration (ms): 5918.7 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.257920E+00 | loss scale: 1.0 | grad norm: 0.672 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:43:33.746602 | finish at 2025-09-10 12:27:36 + [2025-09-09 21:44:08] iteration 2964/ 11920 | consumed samples: 3035136 | elapsed time per iteration (ms): 5652.9 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.241869E+00 | loss scale: 1.0 | grad norm: 0.382 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:03:46.951576 | finish at 2025-09-10 11:47:55 + [2025-09-09 21:44:13] iteration 2965/ 11920 | consumed samples: 3036160 | elapsed time per iteration (ms): 5647.2 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.244564E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:02:50.258496 | finish at 2025-09-10 11:47:03 + [2025-09-09 21:44:19] iteration 2966/ 11920 | consumed samples: 3037184 | elapsed time per iteration (ms): 5650.9 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.221380E+00 | loss scale: 1.0 | grad norm: 0.330 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:03:18.142645 | finish at 2025-09-10 11:47:37 + [2025-09-09 21:44:24] iteration 2967/ 11920 | consumed samples: 3038208 | elapsed time per iteration (ms): 5650.2 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.213415E+00 | loss scale: 1.0 | grad norm: 0.454 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:03:06.013353 | finish at 2025-09-10 11:47:31 + [2025-09-09 21:44:30] iteration 2968/ 11920 | consumed samples: 3039232 | elapsed time per iteration (ms): 5652.7 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.218366E+00 | loss scale: 1.0 | grad norm: 0.500 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:03:23.057436 | finish at 2025-09-10 11:47:53 + [2025-09-09 21:44:36] iteration 2969/ 11920 | consumed samples: 3040256 | elapsed time per iteration (ms): 5644.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.212716E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:02:00.633163 | finish at 2025-09-10 11:46:36 + [2025-09-09 21:44:41] iteration 2970/ 11920 | consumed samples: 3041280 | elapsed time per iteration (ms): 5647.3 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.196109E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:02:23.514287 | finish at 2025-09-10 11:47:05 + [2025-09-09 21:44:47] iteration 2971/ 11920 | consumed samples: 3042304 | elapsed time per iteration (ms): 5650.3 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.197966E+00 | loss scale: 1.0 | grad norm: 0.345 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:02:44.362110 | finish at 2025-09-10 11:47:31 + [2025-09-09 21:44:53] iteration 2972/ 11920 | consumed samples: 3043328 | elapsed time per iteration (ms): 5652.7 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.186207E+00 | loss scale: 1.0 | grad norm: 0.337 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:03:00.026323 | finish at 2025-09-10 11:47:53 + [2025-09-09 21:44:58] iteration 2973/ 11920 | consumed samples: 3044352 | elapsed time per iteration (ms): 5651.4 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.172812E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:02:43.471228 | finish at 2025-09-10 11:47:42 + [2025-09-09 21:45:04] iteration 2974/ 11920 | consumed samples: 3045376 | elapsed time per iteration (ms): 5667.6 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.166104E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:05:01.909477 | finish at 2025-09-10 11:50:06 + [2025-09-09 21:45:10] iteration 2975/ 11920 | consumed samples: 3046400 | elapsed time per iteration (ms): 5658.5 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.174330E+00 | loss scale: 1.0 | grad norm: 0.317 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:03:35.474046 | finish at 2025-09-10 11:48:45 + [2025-09-09 21:45:15] iteration 2976/ 11920 | consumed samples: 3047424 | elapsed time per iteration (ms): 5661.2 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.171210E+00 | loss scale: 1.0 | grad norm: 0.399 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:03:53.717773 | finish at 2025-09-10 11:49:09 + [2025-09-09 21:45:21] iteration 2977/ 11920 | consumed samples: 3048448 | elapsed time per iteration (ms): 5645.4 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.161256E+00 | loss scale: 1.0 | grad norm: 0.275 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:01:26.413906 | finish at 2025-09-10 11:46:47 + [2025-09-09 21:45:27] iteration 2978/ 11920 | consumed samples: 3049472 | elapsed time per iteration (ms): 6028.6 | throughput per GPU (TFLOP/s/GPU): 74.9 | MFU 7.57% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.156569E+00 | loss scale: 1.0 | grad norm: 0.315 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:58:27.608819 | finish at 2025-09-10 12:43:55 + [2025-09-09 21:45:33] iteration 2979/ 11920 | consumed samples: 3050496 | elapsed time per iteration (ms): 5654.2 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.156761E+00 | loss scale: 1.0 | grad norm: 0.474 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:02:34.422454 | finish at 2025-09-10 11:48:07 + [2025-09-09 21:45:38] iteration 2980/ 11920 | consumed samples: 3051520 | elapsed time per iteration (ms): 5656.0 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.154675E+00 | loss scale: 1.0 | grad norm: 0.411 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:02:44.300194 | finish at 2025-09-10 11:48:23 + [2025-09-09 21:45:44] iteration 2981/ 11920 | consumed samples: 3052544 | elapsed time per iteration (ms): 5644.2 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.148267E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:00:53.594085 | finish at 2025-09-10 11:46:38 + [2025-09-09 21:45:50] iteration 2982/ 11920 | consumed samples: 3053568 | elapsed time per iteration (ms): 5989.7 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.131065E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:52:15.577038 | finish at 2025-09-10 12:38:06 + [2025-09-09 21:45:56] iteration 2983/ 11920 | consumed samples: 3054592 | elapsed time per iteration (ms): 5650.6 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.139006E+00 | loss scale: 1.0 | grad norm: 0.297 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:01:39.690939 | finish at 2025-09-10 11:47:35 + [2025-09-09 21:46:01] iteration 2984/ 11920 | consumed samples: 3055616 | elapsed time per iteration (ms): 5655.9 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.139096E+00 | loss scale: 1.0 | grad norm: 0.396 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:02:21.003105 | finish at 2025-09-10 11:48:22 + [2025-09-09 21:46:07] iteration 2985/ 11920 | consumed samples: 3056640 | elapsed time per iteration (ms): 5642.8 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.123363E+00 | loss scale: 1.0 | grad norm: 0.409 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:00:18.704284 | finish at 2025-09-10 11:46:26 + [2025-09-09 21:46:13] iteration 2986/ 11920 | consumed samples: 3057664 | elapsed time per iteration (ms): 5996.1 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.120391E+00 | loss scale: 1.0 | grad norm: 0.395 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:52:49.110083 | finish at 2025-09-10 12:39:02 + [2025-09-09 21:46:19] iteration 2987/ 11920 | consumed samples: 3058688 | elapsed time per iteration (ms): 5655.3 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.131837E+00 | loss scale: 1.0 | grad norm: 0.414 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:01:58.651328 | finish at 2025-09-10 11:48:17 + [2025-09-09 21:46:24] iteration 2988/ 11920 | consumed samples: 3059712 | elapsed time per iteration (ms): 5890.9 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.134016E+00 | loss scale: 1.0 | grad norm: 0.476 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:36:57.285755 | finish at 2025-09-10 12:23:22 + [2025-09-09 21:46:30] iteration 2989/ 11920 | consumed samples: 3060736 | elapsed time per iteration (ms): 5990.6 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.140392E+00 | loss scale: 1.0 | grad norm: 0.556 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:51:42.064480 | finish at 2025-09-10 12:38:13 + [2025-09-09 21:46:36] iteration 2990/ 11920 | consumed samples: 3061760 | elapsed time per iteration (ms): 5647.9 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.141917E+00 | loss scale: 1.0 | grad norm: 0.396 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:00:35.628705 | finish at 2025-09-10 11:47:12 + [2025-09-09 21:46:42] iteration 2991/ 11920 | consumed samples: 3062784 | elapsed time per iteration (ms): 5988.9 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.125966E+00 | loss scale: 1.0 | grad norm: 0.384 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:51:14.828012 | finish at 2025-09-10 12:37:57 + [2025-09-09 21:46:48] iteration 2992/ 11920 | consumed samples: 3063808 | elapsed time per iteration (ms): 5897.1 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.125342E+00 | loss scale: 1.0 | grad norm: 0.460 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:37:29.327705 | finish at 2025-09-10 12:24:17 + [2025-09-09 21:46:54] iteration 2993/ 11920 | consumed samples: 3064832 | elapsed time per iteration (ms): 5648.2 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.157217E+00 | loss scale: 1.0 | grad norm: 0.400 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:00:21.758400 | finish at 2025-09-10 11:47:15 + [2025-09-09 21:46:59] iteration 2994/ 11920 | consumed samples: 3065856 | elapsed time per iteration (ms): 5653.7 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.138496E+00 | loss scale: 1.0 | grad norm: 0.323 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:01:04.986799 | finish at 2025-09-10 11:48:04 + [2025-09-09 21:47:05] iteration 2995/ 11920 | consumed samples: 3066880 | elapsed time per iteration (ms): 6010.6 | throughput per GPU (TFLOP/s/GPU): 75.1 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.149279E+00 | loss scale: 1.0 | grad norm: 0.346 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:54:04.299388 | finish at 2025-09-10 12:41:10 + [2025-09-09 21:47:11] iteration 2996/ 11920 | consumed samples: 3067904 | elapsed time per iteration (ms): 5959.6 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.124704E+00 | loss scale: 1.0 | grad norm: 0.338 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:46:23.531850 | finish at 2025-09-10 12:33:35 + [2025-09-09 21:47:17] iteration 2997/ 11920 | consumed samples: 3068928 | elapsed time per iteration (ms): 5645.0 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.130721E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:59:30.534806 | finish at 2025-09-10 11:46:47 + [2025-09-09 21:47:23] iteration 2998/ 11920 | consumed samples: 3069952 | elapsed time per iteration (ms): 5647.5 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.116509E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:59:47.101699 | finish at 2025-09-10 11:47:10 + [2025-09-09 21:47:29] iteration 2999/ 11920 | consumed samples: 3070976 | elapsed time per iteration (ms): 5960.4 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.124050E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:46:12.597463 | finish at 2025-09-10 12:33:41 + [2025-09-09 21:47:34] iteration 3000/ 11920 | consumed samples: 3072000 | elapsed time per iteration (ms): 5640.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.106710E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:58:29.470844 | finish at 2025-09-10 11:46:04 + [2025-09-09 21:47:40] iteration 3001/ 11920 | consumed samples: 3073024 | elapsed time per iteration (ms): 6127.7 | throughput per GPU (TFLOP/s/GPU): 73.7 | MFU 7.45% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.100597E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:10:52.655150 | finish at 2025-09-10 12:58:33 + [2025-09-09 21:47:46] iteration 3002/ 11920 | consumed samples: 3074048 | elapsed time per iteration (ms): 6017.6 | throughput per GPU (TFLOP/s/GPU): 75.0 | MFU 7.59% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.095808E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:54:24.953078 | finish at 2025-09-10 12:42:11 + [2025-09-09 21:47:52] iteration 3003/ 11920 | consumed samples: 3075072 | elapsed time per iteration (ms): 5656.9 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.100940E+00 | loss scale: 1.0 | grad norm: 0.125 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:00:42.748872 | finish at 2025-09-10 11:48:35 + [2025-09-09 21:47:58] iteration 3004/ 11920 | consumed samples: 3076096 | elapsed time per iteration (ms): 5963.1 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.103857E+00 | loss scale: 1.0 | grad norm: 0.127 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:46:07.022595 | finish at 2025-09-10 12:34:05 + [2025-09-09 21:48:04] iteration 3005/ 11920 | consumed samples: 3077120 | elapsed time per iteration (ms): 5667.6 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.112971E+00 | loss scale: 1.0 | grad norm: 0.130 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:02:06.617122 | finish at 2025-09-10 11:50:10 + [2025-09-09 21:48:10] iteration 3006/ 11920 | consumed samples: 3078144 | elapsed time per iteration (ms): 6023.4 | throughput per GPU (TFLOP/s/GPU): 75.0 | MFU 7.58% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.082453E+00 | loss scale: 1.0 | grad norm: 0.123 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:54:52.836864 | finish at 2025-09-10 12:43:02 + [2025-09-09 21:48:15] iteration 3007/ 11920 | consumed samples: 3079168 | elapsed time per iteration (ms): 5643.0 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.090564E+00 | loss scale: 1.0 | grad norm: 0.124 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:58:15.853995 | finish at 2025-09-10 11:46:31 + [2025-09-09 21:48:21] iteration 3008/ 11920 | consumed samples: 3080192 | elapsed time per iteration (ms): 5635.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.100258E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:56:59.425884 | finish at 2025-09-10 11:45:20 + [2025-09-09 21:48:27] iteration 3009/ 11920 | consumed samples: 3081216 | elapsed time per iteration (ms): 5881.3 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.084448E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:33:27.977241 | finish at 2025-09-10 12:21:55 + [2025-09-09 21:48:32] iteration 3010/ 11920 | consumed samples: 3082240 | elapsed time per iteration (ms): 5639.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.086687E+00 | loss scale: 1.0 | grad norm: 0.314 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:57:27.861285 | finish at 2025-09-10 11:46:00 + [2025-09-09 21:48:38] iteration 3011/ 11920 | consumed samples: 3083264 | elapsed time per iteration (ms): 5652.1 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.089607E+00 | loss scale: 1.0 | grad norm: 0.459 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:59:14.555420 | finish at 2025-09-10 11:47:53 + [2025-09-09 21:48:44] iteration 3012/ 11920 | consumed samples: 3084288 | elapsed time per iteration (ms): 6279.4 | throughput per GPU (TFLOP/s/GPU): 71.9 | MFU 7.27% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.100360E+00 | loss scale: 1.0 | grad norm: 0.412 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:32:16.949278 | finish at 2025-09-10 13:21:01 + [2025-09-09 21:48:50] iteration 3013/ 11920 | consumed samples: 3085312 | elapsed time per iteration (ms): 5628.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.103818E+00 | loss scale: 1.0 | grad norm: 0.299 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:55:28.642906 | finish at 2025-09-10 11:44:19 + [2025-09-09 21:48:56] iteration 3014/ 11920 | consumed samples: 3086336 | elapsed time per iteration (ms): 5639.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.082823E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:57:08.736789 | finish at 2025-09-10 11:46:04 + [2025-09-09 21:49:01] iteration 3015/ 11920 | consumed samples: 3087360 | elapsed time per iteration (ms): 5647.7 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.083782E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:58:13.036648 | finish at 2025-09-10 11:47:14 + [2025-09-09 21:49:08] iteration 3016/ 11920 | consumed samples: 3088384 | elapsed time per iteration (ms): 6226.0 | throughput per GPU (TFLOP/s/GPU): 72.5 | MFU 7.33% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.077398E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:23:56.434124 | finish at 2025-09-10 13:13:04 + [2025-09-09 21:49:13] iteration 3017/ 11920 | consumed samples: 3089408 | elapsed time per iteration (ms): 5634.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.084560E+00 | loss scale: 1.0 | grad norm: 0.238 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:56:04.259398 | finish at 2025-09-10 11:45:17 + [2025-09-09 21:49:19] iteration 3018/ 11920 | consumed samples: 3090432 | elapsed time per iteration (ms): 5629.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.072959E+00 | loss scale: 1.0 | grad norm: 0.247 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:55:12.657877 | finish at 2025-09-10 11:44:31 + [2025-09-09 21:49:25] iteration 3019/ 11920 | consumed samples: 3091456 | elapsed time per iteration (ms): 5862.2 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.081558E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:29:39.083748 | finish at 2025-09-10 12:19:04 + [2025-09-09 21:49:30] iteration 3020/ 11920 | consumed samples: 3092480 | elapsed time per iteration (ms): 5634.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.081340E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:55:48.677754 | finish at 2025-09-10 11:45:19 + [2025-09-09 21:49:36] iteration 3021/ 11920 | consumed samples: 3093504 | elapsed time per iteration (ms): 5637.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.067465E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:56:10.527404 | finish at 2025-09-10 11:45:46 + [2025-09-09 21:49:42] iteration 3022/ 11920 | consumed samples: 3094528 | elapsed time per iteration (ms): 5635.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.064706E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:55:48.291420 | finish at 2025-09-10 11:45:30 + [2025-09-09 21:49:47] iteration 3023/ 11920 | consumed samples: 3095552 | elapsed time per iteration (ms): 5630.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.087816E+00 | loss scale: 1.0 | grad norm: 0.319 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:54:50.885260 | finish at 2025-09-10 11:44:38 + [2025-09-09 21:49:53] iteration 3024/ 11920 | consumed samples: 3096576 | elapsed time per iteration (ms): 5930.6 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.075879E+00 | loss scale: 1.0 | grad norm: 0.388 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:39:18.419708 | finish at 2025-09-10 12:29:12 + [2025-09-09 21:49:59] iteration 3025/ 11920 | consumed samples: 3097600 | elapsed time per iteration (ms): 5631.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.081140E+00 | loss scale: 1.0 | grad norm: 0.264 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:54:53.736445 | finish at 2025-09-10 11:44:52 + [2025-09-09 21:50:04] iteration 3026/ 11920 | consumed samples: 3098624 | elapsed time per iteration (ms): 5635.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.085913E+00 | loss scale: 1.0 | grad norm: 0.261 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:55:24.978056 | finish at 2025-09-10 11:45:29 + [2025-09-09 21:50:10] iteration 3027/ 11920 | consumed samples: 3099648 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.079117E+00 | loss scale: 1.0 | grad norm: 0.257 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:53:09.192416 | finish at 2025-09-10 11:43:19 + [2025-09-09 21:50:16] iteration 3028/ 11920 | consumed samples: 3100672 | elapsed time per iteration (ms): 5640.8 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.073640E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:55:57.611990 | finish at 2025-09-10 11:46:13 + [2025-09-09 21:50:22] iteration 3029/ 11920 | consumed samples: 3101696 | elapsed time per iteration (ms): 6009.8 | throughput per GPU (TFLOP/s/GPU): 75.1 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.068368E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:50:33.241649 | finish at 2025-09-10 12:40:55 + [2025-09-09 21:50:27] iteration 3030/ 11920 | consumed samples: 3102720 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.060806E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:53:14.865944 | finish at 2025-09-10 11:43:42 + [2025-09-09 21:50:33] iteration 3031/ 11920 | consumed samples: 3103744 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.069435E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:53:08.608553 | finish at 2025-09-10 11:43:42 + [2025-09-09 21:50:39] iteration 3032/ 11920 | consumed samples: 3104768 | elapsed time per iteration (ms): 5632.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.075737E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:54:18.595240 | finish at 2025-09-10 11:44:57 + [2025-09-09 21:50:44] iteration 3033/ 11920 | consumed samples: 3105792 | elapsed time per iteration (ms): 5625.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.055699E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:53:12.930390 | finish at 2025-09-10 11:43:57 + [2025-09-09 21:50:50] iteration 3034/ 11920 | consumed samples: 3106816 | elapsed time per iteration (ms): 5619.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.045846E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:52:12.706872 | finish at 2025-09-10 11:43:02 + [2025-09-09 21:50:55] iteration 3035/ 11920 | consumed samples: 3107840 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.073661E+00 | loss scale: 1.0 | grad norm: 0.124 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:52:15.427556 | finish at 2025-09-10 11:43:11 + [2025-09-09 21:51:01] iteration 3036/ 11920 | consumed samples: 3108864 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.060648E+00 | loss scale: 1.0 | grad norm: 0.124 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:52:38.094729 | finish at 2025-09-10 11:43:39 + [2025-09-09 21:51:07] iteration 3037/ 11920 | consumed samples: 3109888 | elapsed time per iteration (ms): 5632.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.062997E+00 | loss scale: 1.0 | grad norm: 0.101 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:53:50.396342 | finish at 2025-09-10 11:44:57 + [2025-09-09 21:51:12] iteration 3038/ 11920 | consumed samples: 3110912 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.063602E+00 | loss scale: 1.0 | grad norm: 0.096 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:52:38.422959 | finish at 2025-09-10 11:43:51 + [2025-09-09 21:51:18] iteration 3039/ 11920 | consumed samples: 3111936 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.061152E+00 | loss scale: 1.0 | grad norm: 0.114 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:52:19.246947 | finish at 2025-09-10 11:43:37 + [2025-09-09 21:51:24] iteration 3040/ 11920 | consumed samples: 3112960 | elapsed time per iteration (ms): 5631.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.061388E+00 | loss scale: 1.0 | grad norm: 0.126 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:53:30.933895 | finish at 2025-09-10 11:44:54 + [2025-09-09 21:51:29] iteration 3041/ 11920 | consumed samples: 3113984 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.042136E+00 | loss scale: 1.0 | grad norm: 0.128 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:52:34.102241 | finish at 2025-09-10 11:44:03 + [2025-09-09 21:51:35] iteration 3042/ 11920 | consumed samples: 3115008 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.055746E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:51:33.088976 | finish at 2025-09-10 11:43:08 + [2025-09-09 21:51:40] iteration 3043/ 11920 | consumed samples: 3116032 | elapsed time per iteration (ms): 5628.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.058843E+00 | loss scale: 1.0 | grad norm: 0.274 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:52:46.873780 | finish at 2025-09-10 11:44:27 + [2025-09-09 21:51:46] iteration 3044/ 11920 | consumed samples: 3117056 | elapsed time per iteration (ms): 5627.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.075596E+00 | loss scale: 1.0 | grad norm: 0.307 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:52:25.043325 | finish at 2025-09-10 11:44:11 + [2025-09-09 21:51:52] iteration 3045/ 11920 | consumed samples: 3118080 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.058301E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:51:57.575359 | finish at 2025-09-10 11:43:49 + [2025-09-09 21:51:57] iteration 3046/ 11920 | consumed samples: 3119104 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.027113E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:51:50.873939 | finish at 2025-09-10 11:43:48 + [2025-09-09 21:52:03] iteration 3047/ 11920 | consumed samples: 3120128 | elapsed time per iteration (ms): 6009.0 | throughput per GPU (TFLOP/s/GPU): 75.1 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.054010E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:48:37.830057 | finish at 2025-09-10 12:40:41 + [2025-09-09 21:52:09] iteration 3048/ 11920 | consumed samples: 3121152 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.030975E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:51:29.012943 | finish at 2025-09-10 11:43:38 + [2025-09-09 21:52:15] iteration 3049/ 11920 | consumed samples: 3122176 | elapsed time per iteration (ms): 5638.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.037907E+00 | loss scale: 1.0 | grad norm: 0.131 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:53:42.358672 | finish at 2025-09-10 11:45:57 + [2025-09-09 21:52:20] iteration 3050/ 11920 | consumed samples: 3123200 | elapsed time per iteration (ms): 5629.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.048779E+00 | loss scale: 1.0 | grad norm: 0.124 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:52:17.346041 | finish at 2025-09-10 11:44:38 + [2025-09-09 21:52:26] iteration 3051/ 11920 | consumed samples: 3124224 | elapsed time per iteration (ms): 5944.4 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.046780E+00 | loss scale: 1.0 | grad norm: 0.109 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:38:41.321948 | finish at 2025-09-10 12:31:07 + [2025-09-09 21:52:32] iteration 3052/ 11920 | consumed samples: 3125248 | elapsed time per iteration (ms): 5637.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.042382E+00 | loss scale: 1.0 | grad norm: 0.090 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:53:10.436923 | finish at 2025-09-10 11:45:42 + [2025-09-09 21:52:37] iteration 3053/ 11920 | consumed samples: 3126272 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.035405E+00 | loss scale: 1.0 | grad norm: 0.109 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:50:40.665422 | finish at 2025-09-10 11:43:18 + [2025-09-09 21:52:43] iteration 3054/ 11920 | consumed samples: 3127296 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.036029E+00 | loss scale: 1.0 | grad norm: 0.104 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:51:22.468038 | finish at 2025-09-10 11:44:05 + [2025-09-09 21:52:49] iteration 3055/ 11920 | consumed samples: 3128320 | elapsed time per iteration (ms): 5639.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.042238E+00 | loss scale: 1.0 | grad norm: 0.104 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:53:15.772959 | finish at 2025-09-10 11:46:04 + [2025-09-09 21:52:54] iteration 3056/ 11920 | consumed samples: 3129344 | elapsed time per iteration (ms): 5627.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.047402E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:51:23.994888 | finish at 2025-09-10 11:44:18 + [2025-09-09 21:53:00] iteration 3057/ 11920 | consumed samples: 3130368 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.048059E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:50:16.051748 | finish at 2025-09-10 11:43:16 + [2025-09-09 21:53:06] iteration 3058/ 11920 | consumed samples: 3131392 | elapsed time per iteration (ms): 5626.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.041915E+00 | loss scale: 1.0 | grad norm: 0.247 | num zeros: 11.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:51:02.010344 | finish at 2025-09-10 11:44:08 + [2025-09-09 21:53:11] iteration 3059/ 11920 | consumed samples: 3132416 | elapsed time per iteration (ms): 5627.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.051545E+00 | loss scale: 1.0 | grad norm: 0.243 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:51:00.942896 | finish at 2025-09-10 11:44:12 + [2025-09-09 21:53:17] iteration 3060/ 11920 | consumed samples: 3133440 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.036228E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:49:51.745663 | finish at 2025-09-10 11:43:09 + [2025-09-09 21:53:22] iteration 3061/ 11920 | consumed samples: 3134464 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.034728E+00 | loss scale: 1.0 | grad norm: 0.322 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:50:27.409914 | finish at 2025-09-10 11:43:50 + [2025-09-09 21:53:28] iteration 3062/ 11920 | consumed samples: 3135488 | elapsed time per iteration (ms): 5639.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.039766E+00 | loss scale: 1.0 | grad norm: 0.327 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:52:36.922429 | finish at 2025-09-10 11:46:05 + [2025-09-09 21:53:34] iteration 3063/ 11920 | consumed samples: 3136512 | elapsed time per iteration (ms): 5633.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.042514E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:51:32.350097 | finish at 2025-09-10 11:45:06 + [2025-09-09 21:53:39] iteration 3064/ 11920 | consumed samples: 3137536 | elapsed time per iteration (ms): 5638.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.046761E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:52:17.045162 | finish at 2025-09-10 11:45:56 + [2025-09-09 21:53:45] iteration 3065/ 11920 | consumed samples: 3138560 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.050009E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:49:23.669707 | finish at 2025-09-10 11:43:09 + [2025-09-09 21:53:51] iteration 3066/ 11920 | consumed samples: 3139584 | elapsed time per iteration (ms): 5630.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.051303E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:50:52.589127 | finish at 2025-09-10 11:44:43 + [2025-09-09 21:53:56] iteration 3067/ 11920 | consumed samples: 3140608 | elapsed time per iteration (ms): 5618.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.057027E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:48:58.590041 | finish at 2025-09-10 11:42:55 + [2025-09-09 21:54:02] iteration 3068/ 11920 | consumed samples: 3141632 | elapsed time per iteration (ms): 5618.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.039532E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:48:57.538847 | finish at 2025-09-10 11:42:59 + [2025-09-09 21:54:07] iteration 3069/ 11920 | consumed samples: 3142656 | elapsed time per iteration (ms): 5627.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.039285E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:50:05.717358 | finish at 2025-09-10 11:44:13 + [2025-09-09 21:54:13] iteration 3070/ 11920 | consumed samples: 3143680 | elapsed time per iteration (ms): 5618.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.039055E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:48:46.657856 | finish at 2025-09-10 11:43:00 + [2025-09-09 21:54:19] iteration 3071/ 11920 | consumed samples: 3144704 | elapsed time per iteration (ms): 5632.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.023916E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:50:43.097427 | finish at 2025-09-10 11:45:02 + [2025-09-09 21:54:24] iteration 3072/ 11920 | consumed samples: 3145728 | elapsed time per iteration (ms): 5640.2 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.037462E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:51:44.054150 | finish at 2025-09-10 11:46:08 + [2025-09-09 21:54:30] iteration 3073/ 11920 | consumed samples: 3146752 | elapsed time per iteration (ms): 5642.5 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.023961E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:51:59.129328 | finish at 2025-09-10 11:46:29 + [2025-09-09 21:54:36] iteration 3074/ 11920 | consumed samples: 3147776 | elapsed time per iteration (ms): 5626.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.032055E+00 | loss scale: 1.0 | grad norm: 0.142 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:49:32.749879 | finish at 2025-09-10 11:44:08 + [2025-09-09 21:54:41] iteration 3075/ 11920 | consumed samples: 3148800 | elapsed time per iteration (ms): 5631.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.044985E+00 | loss scale: 1.0 | grad norm: 0.126 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:50:07.747457 | finish at 2025-09-10 11:44:49 + [2025-09-09 21:54:47] iteration 3076/ 11920 | consumed samples: 3149824 | elapsed time per iteration (ms): 5629.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.035859E+00 | loss scale: 1.0 | grad norm: 0.137 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:49:44.123820 | finish at 2025-09-10 11:44:31 + [2025-09-09 21:54:52] iteration 3077/ 11920 | consumed samples: 3150848 | elapsed time per iteration (ms): 5625.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.033671E+00 | loss scale: 1.0 | grad norm: 0.133 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:49:05.457062 | finish at 2025-09-10 11:43:58 + [2025-09-09 21:54:58] iteration 3078/ 11920 | consumed samples: 3151872 | elapsed time per iteration (ms): 5626.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.018103E+00 | loss scale: 1.0 | grad norm: 0.135 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:49:12.600401 | finish at 2025-09-10 11:44:11 + [2025-09-09 21:55:04] iteration 3079/ 11920 | consumed samples: 3152896 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.033154E+00 | loss scale: 1.0 | grad norm: 0.133 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:48:21.561845 | finish at 2025-09-10 11:43:25 + [2025-09-09 21:55:09] iteration 3080/ 11920 | consumed samples: 3153920 | elapsed time per iteration (ms): 5615.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.039104E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:47:23.527832 | finish at 2025-09-10 11:42:33 + [2025-09-09 21:55:15] iteration 3081/ 11920 | consumed samples: 3154944 | elapsed time per iteration (ms): 5617.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.036437E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:47:34.204217 | finish at 2025-09-10 11:42:49 + [2025-09-09 21:55:21] iteration 3082/ 11920 | consumed samples: 3155968 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.030075E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:48:02.614849 | finish at 2025-09-10 11:43:23 + [2025-09-09 21:55:26] iteration 3083/ 11920 | consumed samples: 3156992 | elapsed time per iteration (ms): 5630.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.026081E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:49:20.001215 | finish at 2025-09-10 11:44:46 + [2025-09-09 21:55:32] iteration 3084/ 11920 | consumed samples: 3158016 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.019795E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:47:57.944695 | finish at 2025-09-10 11:43:30 + [2025-09-09 21:55:37] iteration 3085/ 11920 | consumed samples: 3159040 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.036002E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:47:33.440452 | finish at 2025-09-10 11:43:11 + [2025-09-09 21:55:43] iteration 3086/ 11920 | consumed samples: 3160064 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.030702E+00 | loss scale: 1.0 | grad norm: 0.122 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:48:10.003136 | finish at 2025-09-10 11:43:53 + [2025-09-09 21:55:49] iteration 3087/ 11920 | consumed samples: 3161088 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.021311E+00 | loss scale: 1.0 | grad norm: 0.118 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:48:12.281912 | finish at 2025-09-10 11:44:01 + [2025-09-09 21:55:54] iteration 3088/ 11920 | consumed samples: 3162112 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.032657E+00 | loss scale: 1.0 | grad norm: 0.132 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:47:54.388275 | finish at 2025-09-10 11:43:49 + [2025-09-09 21:56:00] iteration 3089/ 11920 | consumed samples: 3163136 | elapsed time per iteration (ms): 5617.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.024316E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:46:44.487986 | finish at 2025-09-10 11:42:44 + [2025-09-09 21:56:06] iteration 3090/ 11920 | consumed samples: 3164160 | elapsed time per iteration (ms): 5988.2 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.021348E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:41:16.016076 | finish at 2025-09-10 12:37:22 + [2025-09-09 21:56:12] iteration 3091/ 11920 | consumed samples: 3165184 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.039327E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:47:00.662988 | finish at 2025-09-10 11:43:12 + [2025-09-09 21:56:17] iteration 3092/ 11920 | consumed samples: 3166208 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.013131E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:47:24.580988 | finish at 2025-09-10 11:43:42 + [2025-09-09 21:56:23] iteration 3093/ 11920 | consumed samples: 3167232 | elapsed time per iteration (ms): 5639.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.030767E+00 | loss scale: 1.0 | grad norm: 0.117 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:49:41.574512 | finish at 2025-09-10 11:46:04 + [2025-09-09 21:56:29] iteration 3094/ 11920 | consumed samples: 3168256 | elapsed time per iteration (ms): 5953.8 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.027652E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:35:48.204805 | finish at 2025-09-10 12:32:17 + [2025-09-09 21:56:34] iteration 3095/ 11920 | consumed samples: 3169280 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.021913E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:47:04.850982 | finish at 2025-09-10 11:43:39 + [2025-09-09 21:56:40] iteration 3096/ 11920 | consumed samples: 3170304 | elapsed time per iteration (ms): 5632.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.011707E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:48:17.386250 | finish at 2025-09-10 11:44:57 + [2025-09-09 21:56:46] iteration 3097/ 11920 | consumed samples: 3171328 | elapsed time per iteration (ms): 5619.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.025983E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:46:19.360588 | finish at 2025-09-10 11:43:05 + [2025-09-09 21:56:52] iteration 3098/ 11920 | consumed samples: 3172352 | elapsed time per iteration (ms): 5947.4 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.028838E+00 | loss scale: 1.0 | grad norm: 0.266 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:34:28.020411 | finish at 2025-09-10 12:31:20 + [2025-09-09 21:56:57] iteration 3099/ 11920 | consumed samples: 3173376 | elapsed time per iteration (ms): 5630.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.040733E+00 | loss scale: 1.0 | grad norm: 0.265 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:47:48.565517 | finish at 2025-09-10 11:44:46 + [2025-09-09 21:57:03] iteration 3100/ 11920 | consumed samples: 3174400 | elapsed time per iteration (ms): 5954.3 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.013894E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:35:17.308073 | finish at 2025-09-10 12:32:20 + [2025-09-09 21:57:09] iteration 3101/ 11920 | consumed samples: 3175424 | elapsed time per iteration (ms): 5617.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.033480E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:45:41.172535 | finish at 2025-09-10 11:42:50 + [2025-09-09 21:57:14] iteration 3102/ 11920 | consumed samples: 3176448 | elapsed time per iteration (ms): 5617.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.028414E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:45:32.664219 | finish at 2025-09-10 11:42:47 + [2025-09-09 21:57:20] iteration 3103/ 11920 | consumed samples: 3177472 | elapsed time per iteration (ms): 5940.3 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.018779E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:32:55.324387 | finish at 2025-09-10 12:30:16 + [2025-09-09 21:57:26] iteration 3104/ 11920 | consumed samples: 3178496 | elapsed time per iteration (ms): 5636.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.008454E+00 | loss scale: 1.0 | grad norm: 0.121 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:48:11.733974 | finish at 2025-09-10 11:45:38 + [2025-09-09 21:57:32] iteration 3105/ 11920 | consumed samples: 3179520 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.021380E+00 | loss scale: 1.0 | grad norm: 0.120 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:46:17.534097 | finish at 2025-09-10 11:43:49 + [2025-09-09 21:57:37] iteration 3106/ 11920 | consumed samples: 3180544 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.032956E+00 | loss scale: 1.0 | grad norm: 0.116 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:46:24.190580 | finish at 2025-09-10 11:44:01 + [2025-09-09 21:57:43] iteration 3107/ 11920 | consumed samples: 3181568 | elapsed time per iteration (ms): 5628.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.015649E+00 | loss scale: 1.0 | grad norm: 0.114 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:46:46.531708 | finish at 2025-09-10 11:44:29 + [2025-09-09 21:57:49] iteration 3108/ 11920 | consumed samples: 3182592 | elapsed time per iteration (ms): 5904.7 | throughput per GPU (TFLOP/s/GPU): 76.5 | MFU 7.73% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.037116E+00 | loss scale: 1.0 | grad norm: 0.120 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:27:12.120116 | finish at 2025-09-10 12:25:01 + [2025-09-09 21:57:54] iteration 3109/ 11920 | consumed samples: 3183616 | elapsed time per iteration (ms): 5631.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.021367E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:46:55.411495 | finish at 2025-09-10 11:44:50 + [2025-09-09 21:58:00] iteration 3110/ 11920 | consumed samples: 3184640 | elapsed time per iteration (ms): 5628.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.016682E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:46:24.877274 | finish at 2025-09-10 11:44:25 + [2025-09-09 21:58:06] iteration 3111/ 11920 | consumed samples: 3185664 | elapsed time per iteration (ms): 6171.4 | throughput per GPU (TFLOP/s/GPU): 73.2 | MFU 7.40% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.022596E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:06:03.873719 | finish at 2025-09-10 13:04:10 + [2025-09-09 21:58:12] iteration 3112/ 11920 | consumed samples: 3186688 | elapsed time per iteration (ms): 5627.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.019325E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:46:03.236320 | finish at 2025-09-10 11:44:15 + [2025-09-09 21:58:17] iteration 3113/ 11920 | consumed samples: 3187712 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.015335E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:45:29.911416 | finish at 2025-09-10 11:43:47 + [2025-09-09 21:58:24] iteration 3114/ 11920 | consumed samples: 3188736 | elapsed time per iteration (ms): 6042.7 | throughput per GPU (TFLOP/s/GPU): 74.7 | MFU 7.55% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.017143E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:46:51.747922 | finish at 2025-09-10 12:45:15 + [2025-09-09 21:58:29] iteration 3115/ 11920 | consumed samples: 3189760 | elapsed time per iteration (ms): 5632.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.025514E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:46:33.173153 | finish at 2025-09-10 11:45:02 + [2025-09-09 21:58:35] iteration 3116/ 11920 | consumed samples: 3190784 | elapsed time per iteration (ms): 5630.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.015617E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:46:08.143563 | finish at 2025-09-10 11:44:43 + [2025-09-09 21:58:40] iteration 3117/ 11920 | consumed samples: 3191808 | elapsed time per iteration (ms): 5630.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.027562E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:46:04.530324 | finish at 2025-09-10 11:44:45 + [2025-09-09 21:58:46] iteration 3118/ 11920 | consumed samples: 3192832 | elapsed time per iteration (ms): 5891.7 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.015195E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:24:18.724771 | finish at 2025-09-10 12:23:05 + [2025-09-09 21:58:52] iteration 3119/ 11920 | consumed samples: 3193856 | elapsed time per iteration (ms): 5626.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.014698E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:45:15.799763 | finish at 2025-09-10 11:44:08 + [2025-09-09 21:58:58] iteration 3120/ 11920 | consumed samples: 3194880 | elapsed time per iteration (ms): 5628.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.025291E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:45:31.154442 | finish at 2025-09-10 11:44:29 + [2025-09-09 21:59:03] iteration 3121/ 11920 | consumed samples: 3195904 | elapsed time per iteration (ms): 5618.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.027252E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:43:58.868117 | finish at 2025-09-10 11:43:02 + [2025-09-09 21:59:09] iteration 3122/ 11920 | consumed samples: 3196928 | elapsed time per iteration (ms): 5628.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.014191E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:45:22.101946 | finish at 2025-09-10 11:44:31 + [2025-09-09 21:59:14] iteration 3123/ 11920 | consumed samples: 3197952 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.019886E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:44:37.292219 | finish at 2025-09-10 11:43:52 + [2025-09-09 21:59:20] iteration 3124/ 11920 | consumed samples: 3198976 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.016823E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:44:07.246805 | finish at 2025-09-10 11:43:27 + [2025-09-09 21:59:26] iteration 3125/ 11920 | consumed samples: 3200000 | elapsed time per iteration (ms): 5641.2 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.011792E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:46:54.522328 | finish at 2025-09-10 11:46:20 + [2025-09-09 21:59:31] iteration 3126/ 11920 | consumed samples: 3201024 | elapsed time per iteration (ms): 5639.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.012720E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:46:36.263452 | finish at 2025-09-10 11:46:08 + [2025-09-09 21:59:37] iteration 3127/ 11920 | consumed samples: 3202048 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.019633E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:44:08.251961 | finish at 2025-09-10 11:43:45 + [2025-09-09 21:59:43] iteration 3128/ 11920 | consumed samples: 3203072 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.012649E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:44:35.358063 | finish at 2025-09-10 11:44:18 + [2025-09-09 21:59:49] iteration 3129/ 11920 | consumed samples: 3204096 | elapsed time per iteration (ms): 5932.6 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.004375E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:29:13.851481 | finish at 2025-09-10 12:29:02 + [2025-09-09 21:59:54] iteration 3130/ 11920 | consumed samples: 3205120 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.030750E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:43:22.269824 | finish at 2025-09-10 11:43:16 + [2025-09-09 22:00:00] iteration 3131/ 11920 | consumed samples: 3206144 | elapsed time per iteration (ms): 6152.4 | throughput per GPU (TFLOP/s/GPU): 73.4 | MFU 7.42% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.023111E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:01:13.267729 | finish at 2025-09-10 13:01:14 + [2025-09-09 22:00:06] iteration 3132/ 11920 | consumed samples: 3207168 | elapsed time per iteration (ms): 5636.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.004529E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:45:29.913178 | finish at 2025-09-10 11:45:36 + [2025-09-09 22:00:12] iteration 3133/ 11920 | consumed samples: 3208192 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.011101E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:43:51.680893 | finish at 2025-09-10 11:44:03 + [2025-09-09 22:00:17] iteration 3134/ 11920 | consumed samples: 3209216 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.015656E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:43:42.737269 | finish at 2025-09-10 11:44:00 + [2025-09-09 22:00:23] iteration 3135/ 11920 | consumed samples: 3210240 | elapsed time per iteration (ms): 5631.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.020483E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:44:28.565764 | finish at 2025-09-10 11:44:51 + [2025-09-09 22:00:28] iteration 3136/ 11920 | consumed samples: 3211264 | elapsed time per iteration (ms): 5627.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.005956E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:43:48.356266 | finish at 2025-09-10 11:44:17 + [2025-09-09 22:00:34] iteration 3137/ 11920 | consumed samples: 3212288 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.000164E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:42:41.943662 | finish at 2025-09-10 11:43:16 + [2025-09-09 22:00:40] iteration 3138/ 11920 | consumed samples: 3213312 | elapsed time per iteration (ms): 5635.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.019636E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:44:50.809845 | finish at 2025-09-10 11:45:30 + [2025-09-09 22:00:45] iteration 3139/ 11920 | consumed samples: 3214336 | elapsed time per iteration (ms): 5628.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.002144E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:43:40.370506 | finish at 2025-09-10 11:44:26 + [2025-09-09 22:00:51] iteration 3140/ 11920 | consumed samples: 3215360 | elapsed time per iteration (ms): 5628.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.004379E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:43:41.346812 | finish at 2025-09-10 11:44:32 + [2025-09-09 22:00:57] iteration 3141/ 11920 | consumed samples: 3216384 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.007901E+00 | loss scale: 1.0 | grad norm: 0.124 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:42:33.813121 | finish at 2025-09-10 11:43:30 + [2025-09-09 22:01:02] iteration 3142/ 11920 | consumed samples: 3217408 | elapsed time per iteration (ms): 5634.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.010489E+00 | loss scale: 1.0 | grad norm: 0.133 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:44:20.662539 | finish at 2025-09-10 11:45:23 + [2025-09-09 22:01:08] iteration 3143/ 11920 | consumed samples: 3218432 | elapsed time per iteration (ms): 5879.4 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.007053E+00 | loss scale: 1.0 | grad norm: 0.121 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:20:03.822469 | finish at 2025-09-10 12:21:12 + [2025-09-09 22:01:14] iteration 3144/ 11920 | consumed samples: 3219456 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.009459E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:43:05.122240 | finish at 2025-09-10 11:44:19 + [2025-09-09 22:01:19] iteration 3145/ 11920 | consumed samples: 3220480 | elapsed time per iteration (ms): 5631.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.022120E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:43:37.749417 | finish at 2025-09-10 11:44:57 + [2025-09-09 22:01:25] iteration 3146/ 11920 | consumed samples: 3221504 | elapsed time per iteration (ms): 6000.6 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.005305E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:37:29.445176 | finish at 2025-09-10 12:38:55 + [2025-09-09 22:01:31] iteration 3147/ 11920 | consumed samples: 3222528 | elapsed time per iteration (ms): 5638.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.029457E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:44:30.099348 | finish at 2025-09-10 11:46:01 + [2025-09-09 22:01:37] iteration 3148/ 11920 | consumed samples: 3223552 | elapsed time per iteration (ms): 6173.5 | throughput per GPU (TFLOP/s/GPU): 73.1 | MFU 7.39% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.013741E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:02:33.873519 | finish at 2025-09-10 13:04:11 + [2025-09-09 22:01:43] iteration 3149/ 11920 | consumed samples: 3224576 | elapsed time per iteration (ms): 5637.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.010130E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:44:06.533830 | finish at 2025-09-10 11:45:49 + [2025-09-09 22:01:49] iteration 3150/ 11920 | consumed samples: 3225600 | elapsed time per iteration (ms): 6220.1 | throughput per GPU (TFLOP/s/GPU): 72.6 | MFU 7.34% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.009663E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:09:10.343261 | finish at 2025-09-10 13:10:59 + [2025-09-09 22:01:55] iteration 3151/ 11920 | consumed samples: 3226624 | elapsed time per iteration (ms): 6255.6 | throughput per GPU (TFLOP/s/GPU): 72.2 | MFU 7.30% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.008843E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:14:15.521351 | finish at 2025-09-10 13:16:11 + [2025-09-09 22:02:01] iteration 3152/ 11920 | consumed samples: 3227648 | elapsed time per iteration (ms): 5627.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.007858E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:42:19.267715 | finish at 2025-09-10 11:44:20 + [2025-09-09 22:02:07] iteration 3153/ 11920 | consumed samples: 3228672 | elapsed time per iteration (ms): 5624.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.011294E+00 | loss scale: 1.0 | grad norm: 0.126 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:41:50.635604 | finish at 2025-09-10 11:43:57 + [2025-09-09 22:02:12] iteration 3154/ 11920 | consumed samples: 3229696 | elapsed time per iteration (ms): 5628.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.002343E+00 | loss scale: 1.0 | grad norm: 0.123 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:42:17.209220 | finish at 2025-09-10 11:44:29 + [2025-09-09 22:02:18] iteration 3155/ 11920 | consumed samples: 3230720 | elapsed time per iteration (ms): 5618.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.004976E+00 | loss scale: 1.0 | grad norm: 0.131 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:40:44.545441 | finish at 2025-09-10 11:43:02 + [2025-09-09 22:02:23] iteration 3156/ 11920 | consumed samples: 3231744 | elapsed time per iteration (ms): 5630.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.992919E+00 | loss scale: 1.0 | grad norm: 0.128 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:42:25.991036 | finish at 2025-09-10 11:44:49 + [2025-09-09 22:02:29] iteration 3157/ 11920 | consumed samples: 3232768 | elapsed time per iteration (ms): 5635.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.982396E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:43:00.948596 | finish at 2025-09-10 11:45:30 + [2025-09-09 22:02:35] iteration 3158/ 11920 | consumed samples: 3233792 | elapsed time per iteration (ms): 5633.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.013424E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:42:44.427530 | finish at 2025-09-10 11:45:19 + [2025-09-09 22:02:40] iteration 3159/ 11920 | consumed samples: 3234816 | elapsed time per iteration (ms): 5636.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.011797E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:42:58.378057 | finish at 2025-09-10 11:45:39 + [2025-09-09 22:02:46] iteration 3160/ 11920 | consumed samples: 3235840 | elapsed time per iteration (ms): 5637.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.998143E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:43:04.235172 | finish at 2025-09-10 11:45:50 + [2025-09-09 22:02:52] iteration 3161/ 11920 | consumed samples: 3236864 | elapsed time per iteration (ms): 5947.3 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.002266E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:28:12.212378 | finish at 2025-09-10 12:31:04 + [2025-09-09 22:02:58] iteration 3162/ 11920 | consumed samples: 3237888 | elapsed time per iteration (ms): 5634.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.991511E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:42:28.360681 | finish at 2025-09-10 11:45:26 + [2025-09-09 22:03:03] iteration 3163/ 11920 | consumed samples: 3238912 | elapsed time per iteration (ms): 5623.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.019032E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:40:41.365977 | finish at 2025-09-10 11:43:45 + [2025-09-09 22:03:09] iteration 3164/ 11920 | consumed samples: 3239936 | elapsed time per iteration (ms): 5619.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.990998E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:40:02.965591 | finish at 2025-09-10 11:43:12 + [2025-09-09 22:03:14] iteration 3165/ 11920 | consumed samples: 3240960 | elapsed time per iteration (ms): 5626.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.992872E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:41:03.377626 | finish at 2025-09-10 11:44:18 + [2025-09-09 22:03:20] iteration 3166/ 11920 | consumed samples: 3241984 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.012738E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:41:18.465370 | finish at 2025-09-10 11:44:38 + [2025-09-09 22:03:26] iteration 3167/ 11920 | consumed samples: 3243008 | elapsed time per iteration (ms): 5628.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.014794E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:41:10.031355 | finish at 2025-09-10 11:44:36 + [2025-09-09 22:03:32] iteration 3168/ 11920 | consumed samples: 3244032 | elapsed time per iteration (ms): 6000.8 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.011160E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:35:19.034061 | finish at 2025-09-10 12:38:51 + [2025-09-09 22:03:37] iteration 3169/ 11920 | consumed samples: 3245056 | elapsed time per iteration (ms): 5627.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.006753E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:40:43.861985 | finish at 2025-09-10 11:44:21 + [2025-09-09 22:03:43] iteration 3170/ 11920 | consumed samples: 3246080 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.014405E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:40:16.171503 | finish at 2025-09-10 11:43:59 + [2025-09-09 22:03:49] iteration 3171/ 11920 | consumed samples: 3247104 | elapsed time per iteration (ms): 6093.7 | throughput per GPU (TFLOP/s/GPU): 74.1 | MFU 7.49% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.009716E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:48:33.818253 | finish at 2025-09-10 12:52:23 + [2025-09-09 22:03:55] iteration 3172/ 11920 | consumed samples: 3248128 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.997109E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:39:21.904824 | finish at 2025-09-10 11:43:17 + [2025-09-09 22:04:00] iteration 3173/ 11920 | consumed samples: 3249152 | elapsed time per iteration (ms): 5619.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.009288E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:39:12.910783 | finish at 2025-09-10 11:43:13 + [2025-09-09 22:04:06] iteration 3174/ 11920 | consumed samples: 3250176 | elapsed time per iteration (ms): 5619.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.014207E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:39:03.875808 | finish at 2025-09-10 11:43:10 + [2025-09-09 22:04:11] iteration 3175/ 11920 | consumed samples: 3251200 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.005918E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:39:24.379392 | finish at 2025-09-10 11:43:36 + [2025-09-09 22:04:17] iteration 3176/ 11920 | consumed samples: 3252224 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.003814E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:39:26.143599 | finish at 2025-09-10 11:43:43 + [2025-09-09 22:04:23] iteration 3177/ 11920 | consumed samples: 3253248 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.025702E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:39:39.941982 | finish at 2025-09-10 11:44:03 + [2025-09-09 22:04:29] iteration 3178/ 11920 | consumed samples: 3254272 | elapsed time per iteration (ms): 5924.1 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.995345E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:23:08.704535 | finish at 2025-09-10 12:27:37 + [2025-09-09 22:04:34] iteration 3179/ 11920 | consumed samples: 3255296 | elapsed time per iteration (ms): 5628.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.010766E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:39:55.600677 | finish at 2025-09-10 11:44:30 + [2025-09-09 22:04:40] iteration 3180/ 11920 | consumed samples: 3256320 | elapsed time per iteration (ms): 5628.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.006265E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:39:49.816251 | finish at 2025-09-10 11:44:30 + [2025-09-09 22:04:46] iteration 3181/ 11920 | consumed samples: 3257344 | elapsed time per iteration (ms): 5630.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.002076E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:40:03.375445 | finish at 2025-09-10 11:44:49 + [2025-09-09 22:04:51] iteration 3182/ 11920 | consumed samples: 3258368 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.010172E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:38:43.060847 | finish at 2025-09-10 11:43:34 + [2025-09-09 22:04:57] iteration 3183/ 11920 | consumed samples: 3259392 | elapsed time per iteration (ms): 5619.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.999480E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:38:18.416540 | finish at 2025-09-10 11:43:15 + [2025-09-09 22:05:02] iteration 3184/ 11920 | consumed samples: 3260416 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.999629E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:38:51.876984 | finish at 2025-09-10 11:43:54 + [2025-09-09 22:05:08] iteration 3185/ 11920 | consumed samples: 3261440 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.017613E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:38:53.185843 | finish at 2025-09-10 11:44:01 + [2025-09-09 22:05:14] iteration 3186/ 11920 | consumed samples: 3262464 | elapsed time per iteration (ms): 5629.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.004193E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:39:29.384936 | finish at 2025-09-10 11:44:43 + [2025-09-09 22:05:19] iteration 3187/ 11920 | consumed samples: 3263488 | elapsed time per iteration (ms): 5628.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.991799E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:39:10.079989 | finish at 2025-09-10 11:44:29 + [2025-09-09 22:05:25] iteration 3188/ 11920 | consumed samples: 3264512 | elapsed time per iteration (ms): 5638.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.996182E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:40:31.093129 | finish at 2025-09-10 11:45:56 + [2025-09-09 22:05:31] iteration 3189/ 11920 | consumed samples: 3265536 | elapsed time per iteration (ms): 5634.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.007893E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:39:56.487121 | finish at 2025-09-10 11:45:27 + [2025-09-09 22:05:36] iteration 3190/ 11920 | consumed samples: 3266560 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.007188E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:38:15.364308 | finish at 2025-09-10 11:43:52 + [2025-09-09 22:05:42] iteration 3191/ 11920 | consumed samples: 3267584 | elapsed time per iteration (ms): 5640.2 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.003038E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:40:33.581718 | finish at 2025-09-10 11:46:15 + [2025-09-09 22:05:47] iteration 3192/ 11920 | consumed samples: 3268608 | elapsed time per iteration (ms): 5639.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.013470E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:40:18.452503 | finish at 2025-09-10 11:46:06 + [2025-09-09 22:05:53] iteration 3193/ 11920 | consumed samples: 3269632 | elapsed time per iteration (ms): 5637.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.999701E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:40:00.035909 | finish at 2025-09-10 11:45:53 + [2025-09-09 22:05:59] iteration 3194/ 11920 | consumed samples: 3270656 | elapsed time per iteration (ms): 5632.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.006736E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:39:09.995387 | finish at 2025-09-10 11:45:09 + [2025-09-09 22:06:04] iteration 3195/ 11920 | consumed samples: 3271680 | elapsed time per iteration (ms): 5629.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.006626E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:38:37.842299 | finish at 2025-09-10 11:44:42 + [2025-09-09 22:06:10] iteration 3196/ 11920 | consumed samples: 3272704 | elapsed time per iteration (ms): 5631.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.998577E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:38:50.013076 | finish at 2025-09-10 11:45:00 + [2025-09-09 22:06:16] iteration 3197/ 11920 | consumed samples: 3273728 | elapsed time per iteration (ms): 5628.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.005674E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:38:18.432751 | finish at 2025-09-10 11:44:34 + [2025-09-09 22:06:21] iteration 3198/ 11920 | consumed samples: 3274752 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.014784E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:37:13.775820 | finish at 2025-09-10 11:43:35 + [2025-09-09 22:06:27] iteration 3199/ 11920 | consumed samples: 3275776 | elapsed time per iteration (ms): 5636.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.988900E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:39:18.982360 | finish at 2025-09-10 11:45:46 + [2025-09-09 22:06:33] iteration 3200/ 11920 | consumed samples: 3276800 | elapsed time per iteration (ms): 6180.8 | throughput per GPU (TFLOP/s/GPU): 73.0 | MFU 7.39% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.001719E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:58:16.883354 | finish at 2025-09-10 13:04:50 + [2025-09-09 22:06:39] iteration 3201/ 11920 | consumed samples: 3277824 | elapsed time per iteration (ms): 5626.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.989683E+00 | loss scale: 1.0 | grad norm: 0.132 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:37:40.238106 | finish at 2025-09-10 11:44:19 + [2025-09-09 22:06:44] iteration 3202/ 11920 | consumed samples: 3278848 | elapsed time per iteration (ms): 5629.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.998612E+00 | loss scale: 1.0 | grad norm: 0.127 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:37:55.558744 | finish at 2025-09-10 11:44:40 + [2025-09-09 22:06:50] iteration 3203/ 11920 | consumed samples: 3279872 | elapsed time per iteration (ms): 5618.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.005822E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:36:18.929308 | finish at 2025-09-10 11:43:09 + [2025-09-09 22:06:56] iteration 3204/ 11920 | consumed samples: 3280896 | elapsed time per iteration (ms): 5823.2 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.84% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.996834E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:05:55.443326 | finish at 2025-09-10 12:12:51 + [2025-09-09 22:07:01] iteration 3205/ 11920 | consumed samples: 3281920 | elapsed time per iteration (ms): 5618.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.996457E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:36:09.023623 | finish at 2025-09-10 11:43:10 + [2025-09-09 22:07:07] iteration 3206/ 11920 | consumed samples: 3282944 | elapsed time per iteration (ms): 5890.8 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.988992E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:15:32.412495 | finish at 2025-09-10 12:22:40 + [2025-09-09 22:07:13] iteration 3207/ 11920 | consumed samples: 3283968 | elapsed time per iteration (ms): 5618.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.997817E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:35:56.196586 | finish at 2025-09-10 11:43:09 + [2025-09-09 22:07:19] iteration 3208/ 11920 | consumed samples: 3284992 | elapsed time per iteration (ms): 5627.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.991001E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:37:09.551353 | finish at 2025-09-10 11:44:28 + [2025-09-09 22:07:24] iteration 3209/ 11920 | consumed samples: 3286016 | elapsed time per iteration (ms): 5633.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.996578E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:37:48.719420 | finish at 2025-09-10 11:45:13 + [2025-09-09 22:07:30] iteration 3210/ 11920 | consumed samples: 3287040 | elapsed time per iteration (ms): 5838.3 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.994030E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:07:32.002006 | finish at 2025-09-10 12:15:02 + [2025-09-09 22:07:36] iteration 3211/ 11920 | consumed samples: 3288064 | elapsed time per iteration (ms): 5958.3 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.003283E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:24:51.026718 | finish at 2025-09-10 12:32:27 + [2025-09-09 22:07:42] iteration 3212/ 11920 | consumed samples: 3289088 | elapsed time per iteration (ms): 5634.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.999364E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:37:45.795097 | finish at 2025-09-10 11:45:27 + [2025-09-09 22:07:47] iteration 3213/ 11920 | consumed samples: 3290112 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.995092E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:36:23.764947 | finish at 2025-09-10 11:44:11 + [2025-09-09 22:07:53] iteration 3214/ 11920 | consumed samples: 3291136 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.993714E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:36:00.388008 | finish at 2025-09-10 11:43:53 + [2025-09-09 22:07:58] iteration 3215/ 11920 | consumed samples: 3292160 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.990231E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:36:10.711888 | finish at 2025-09-10 11:44:09 + [2025-09-09 22:08:04] iteration 3216/ 11920 | consumed samples: 3293184 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.006328E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:35:54.044189 | finish at 2025-09-10 11:43:58 + [2025-09-09 22:08:10] iteration 3217/ 11920 | consumed samples: 3294208 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.009468E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:35:34.187744 | finish at 2025-09-10 11:43:44 + [2025-09-09 22:08:15] iteration 3218/ 11920 | consumed samples: 3295232 | elapsed time per iteration (ms): 5617.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.012918E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:34:47.209701 | finish at 2025-09-10 11:43:03 + [2025-09-09 22:08:21] iteration 3219/ 11920 | consumed samples: 3296256 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.991159E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:35:41.859567 | finish at 2025-09-10 11:44:03 + [2025-09-09 22:08:27] iteration 3220/ 11920 | consumed samples: 3297280 | elapsed time per iteration (ms): 5629.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.986376E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:36:18.202844 | finish at 2025-09-10 11:44:45 + [2025-09-09 22:08:32] iteration 3221/ 11920 | consumed samples: 3298304 | elapsed time per iteration (ms): 5899.4 | throughput per GPU (TFLOP/s/GPU): 76.5 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.996559E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:15:18.735383 | finish at 2025-09-10 12:23:51 + [2025-09-09 22:08:38] iteration 3222/ 11920 | consumed samples: 3299328 | elapsed time per iteration (ms): 5831.6 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.987375E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:05:23.492781 | finish at 2025-09-10 12:14:02 + [2025-09-09 22:08:44] iteration 3223/ 11920 | consumed samples: 3300352 | elapsed time per iteration (ms): 5993.3 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.000204E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:28:43.524482 | finish at 2025-09-10 12:37:28 + [2025-09-09 22:08:50] iteration 3224/ 11920 | consumed samples: 3301376 | elapsed time per iteration (ms): 5630.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.995778E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:36:04.468651 | finish at 2025-09-10 11:44:54 + [2025-09-09 22:08:56] iteration 3225/ 11920 | consumed samples: 3302400 | elapsed time per iteration (ms): 5634.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.998687E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:36:30.957792 | finish at 2025-09-10 11:45:27 + [2025-09-09 22:09:01] iteration 3226/ 11920 | consumed samples: 3303424 | elapsed time per iteration (ms): 5631.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.993261E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:36:01.357567 | finish at 2025-09-10 11:45:03 + [2025-09-09 22:09:07] iteration 3227/ 11920 | consumed samples: 3304448 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.004641E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:35:04.123025 | finish at 2025-09-10 11:44:11 + [2025-09-09 22:09:12] iteration 3228/ 11920 | consumed samples: 3305472 | elapsed time per iteration (ms): 5638.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.989328E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:36:45.284721 | finish at 2025-09-10 11:45:58 + [2025-09-09 22:09:18] iteration 3229/ 11920 | consumed samples: 3306496 | elapsed time per iteration (ms): 5639.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.993235E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:36:48.430360 | finish at 2025-09-10 11:46:07 + [2025-09-09 22:09:24] iteration 3230/ 11920 | consumed samples: 3307520 | elapsed time per iteration (ms): 5637.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.003659E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:36:32.680709 | finish at 2025-09-10 11:45:56 + [2025-09-09 22:09:29] iteration 3231/ 11920 | consumed samples: 3308544 | elapsed time per iteration (ms): 5637.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.990900E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:36:26.477334 | finish at 2025-09-10 11:45:56 + [2025-09-09 22:09:35] iteration 3232/ 11920 | consumed samples: 3309568 | elapsed time per iteration (ms): 5631.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.993416E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:35:29.224915 | finish at 2025-09-10 11:45:04 + [2025-09-09 22:09:41] iteration 3233/ 11920 | consumed samples: 3310592 | elapsed time per iteration (ms): 5639.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.004558E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:36:31.118547 | finish at 2025-09-10 11:46:12 + [2025-09-09 22:09:46] iteration 3234/ 11920 | consumed samples: 3311616 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.999011E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:34:38.935099 | finish at 2025-09-10 11:44:25 + [2025-09-09 22:09:52] iteration 3235/ 11920 | consumed samples: 3312640 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.007931E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:33:47.276884 | finish at 2025-09-10 11:43:39 + [2025-09-09 22:09:58] iteration 3236/ 11920 | consumed samples: 3313664 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.994299E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:33:37.826641 | finish at 2025-09-10 11:43:35 + [2025-09-09 22:10:03] iteration 3237/ 11920 | consumed samples: 3314688 | elapsed time per iteration (ms): 5626.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.980657E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:34:12.720718 | finish at 2025-09-10 11:44:16 + [2025-09-09 22:10:09] iteration 3238/ 11920 | consumed samples: 3315712 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.003459E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:34:02.381192 | finish at 2025-09-10 11:44:11 + [2025-09-09 22:10:14] iteration 3239/ 11920 | consumed samples: 3316736 | elapsed time per iteration (ms): 5644.2 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.998828E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:36:37.487226 | finish at 2025-09-10 11:46:52 + [2025-09-09 22:10:20] iteration 3240/ 11920 | consumed samples: 3317760 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.979245E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:33:32.759066 | finish at 2025-09-10 11:43:53 + [2025-09-09 22:10:26] iteration 3241/ 11920 | consumed samples: 3318784 | elapsed time per iteration (ms): 5834.7 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.989246E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:03:58.989771 | finish at 2025-09-10 12:14:25 + [2025-09-09 22:10:32] iteration 3242/ 11920 | consumed samples: 3319808 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.986701E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:33:29.452694 | finish at 2025-09-10 11:44:01 + [2025-09-09 22:10:37] iteration 3243/ 11920 | consumed samples: 3320832 | elapsed time per iteration (ms): 5629.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.983739E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:34:03.771769 | finish at 2025-09-10 11:44:41 + [2025-09-09 22:10:43] iteration 3244/ 11920 | consumed samples: 3321856 | elapsed time per iteration (ms): 5625.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.009252E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:33:26.128183 | finish at 2025-09-10 11:44:09 + [2025-09-09 22:10:49] iteration 3245/ 11920 | consumed samples: 3322880 | elapsed time per iteration (ms): 5956.8 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.993113E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:21:15.142652 | finish at 2025-09-10 12:32:04 + [2025-09-09 22:10:54] iteration 3246/ 11920 | consumed samples: 3323904 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.993803E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:33:12.656269 | finish at 2025-09-10 11:44:07 + [2025-09-09 22:11:00] iteration 3247/ 11920 | consumed samples: 3324928 | elapsed time per iteration (ms): 5859.3 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.991863E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:06:57.596355 | finish at 2025-09-10 12:17:58 + [2025-09-09 22:11:06] iteration 3248/ 11920 | consumed samples: 3325952 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.001461E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:32:15.828522 | finish at 2025-09-10 11:43:22 + [2025-09-09 22:11:11] iteration 3249/ 11920 | consumed samples: 3326976 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.968367E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:32:54.110382 | finish at 2025-09-10 11:44:06 + [2025-09-09 22:11:17] iteration 3250/ 11920 | consumed samples: 3328000 | elapsed time per iteration (ms): 5630.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.989148E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:33:32.733521 | finish at 2025-09-10 11:44:50 + [2025-09-09 22:11:23] iteration 3251/ 11920 | consumed samples: 3329024 | elapsed time per iteration (ms): 5629.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.997578E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:33:19.410630 | finish at 2025-09-10 11:44:42 + [2025-09-09 22:11:28] iteration 3252/ 11920 | consumed samples: 3330048 | elapsed time per iteration (ms): 5635.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.987134E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:34:04.539508 | finish at 2025-09-10 11:45:33 + [2025-09-09 22:11:34] iteration 3253/ 11920 | consumed samples: 3331072 | elapsed time per iteration (ms): 5628.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.002720E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:33:02.860276 | finish at 2025-09-10 11:44:37 + [2025-09-09 22:11:40] iteration 3254/ 11920 | consumed samples: 3332096 | elapsed time per iteration (ms): 5628.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.994509E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:32:56.913516 | finish at 2025-09-10 11:44:37 + [2025-09-09 22:11:46] iteration 3255/ 11920 | consumed samples: 3333120 | elapsed time per iteration (ms): 6154.1 | throughput per GPU (TFLOP/s/GPU): 73.4 | MFU 7.42% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.000537E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:48:45.011556 | finish at 2025-09-10 13:00:31 + [2025-09-09 22:11:51] iteration 3256/ 11920 | consumed samples: 3334144 | elapsed time per iteration (ms): 5634.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.997761E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:33:34.600153 | finish at 2025-09-10 11:45:26 + [2025-09-09 22:11:57] iteration 3257/ 11920 | consumed samples: 3335168 | elapsed time per iteration (ms): 5632.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.995682E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:33:13.489772 | finish at 2025-09-10 11:45:11 + [2025-09-09 22:12:03] iteration 3258/ 11920 | consumed samples: 3336192 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.998295E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:31:27.667145 | finish at 2025-09-10 11:43:30 + [2025-09-09 22:12:08] iteration 3259/ 11920 | consumed samples: 3337216 | elapsed time per iteration (ms): 5628.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.999029E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:32:27.362532 | finish at 2025-09-10 11:44:36 + [2025-09-09 22:12:14] iteration 3260/ 11920 | consumed samples: 3338240 | elapsed time per iteration (ms): 5929.2 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.998466E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:15:46.883817 | finish at 2025-09-10 12:28:01 + [2025-09-09 22:12:20] iteration 3261/ 11920 | consumed samples: 3339264 | elapsed time per iteration (ms): 5634.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.966637E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:33:09.759200 | finish at 2025-09-10 11:45:30 + [2025-09-09 22:12:25] iteration 3262/ 11920 | consumed samples: 3340288 | elapsed time per iteration (ms): 5631.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.979375E+00 | loss scale: 1.0 | grad norm: 0.119 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:32:38.164896 | finish at 2025-09-10 11:45:04 + [2025-09-09 22:12:32] iteration 3263/ 11920 | consumed samples: 3341312 | elapsed time per iteration (ms): 6257.4 | throughput per GPU (TFLOP/s/GPU): 72.2 | MFU 7.30% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.991036E+00 | loss scale: 1.0 | grad norm: 0.128 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:02:50.588685 | finish at 2025-09-10 13:15:22 + [2025-09-09 22:12:37] iteration 3264/ 11920 | consumed samples: 3342336 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.972430E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:30:59.794937 | finish at 2025-09-10 11:43:37 + [2025-09-09 22:12:43] iteration 3265/ 11920 | consumed samples: 3343360 | elapsed time per iteration (ms): 5969.4 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.996494E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:21:05.280197 | finish at 2025-09-10 12:33:49 + [2025-09-09 22:12:49] iteration 3266/ 11920 | consumed samples: 3344384 | elapsed time per iteration (ms): 5847.2 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.994796E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:03:21.573170 | finish at 2025-09-10 12:16:11 + [2025-09-09 22:12:55] iteration 3267/ 11920 | consumed samples: 3345408 | elapsed time per iteration (ms): 6241.1 | throughput per GPU (TFLOP/s/GPU): 72.3 | MFU 7.31% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.988076E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:00:04.383343 | finish at 2025-09-10 13:13:00 + [2025-09-09 22:13:02] iteration 3268/ 11920 | consumed samples: 3346432 | elapsed time per iteration (ms): 6268.3 | throughput per GPU (TFLOP/s/GPU): 72.0 | MFU 7.28% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.996601E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:03:53.658010 | finish at 2025-09-10 13:16:55 + [2025-09-09 22:13:07] iteration 3269/ 11920 | consumed samples: 3347456 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.979835E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:30:30.808734 | finish at 2025-09-10 11:43:38 + [2025-09-09 22:13:13] iteration 3270/ 11920 | consumed samples: 3348480 | elapsed time per iteration (ms): 5616.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.988486E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:29:40.123556 | finish at 2025-09-10 11:42:53 + [2025-09-09 22:13:19] iteration 3271/ 11920 | consumed samples: 3349504 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.990678E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:30:35.334660 | finish at 2025-09-10 11:43:54 + [2025-09-09 22:13:24] iteration 3272/ 11920 | consumed samples: 3350528 | elapsed time per iteration (ms): 5633.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.995623E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:31:56.222273 | finish at 2025-09-10 11:45:20 + [2025-09-09 22:13:30] iteration 3273/ 11920 | consumed samples: 3351552 | elapsed time per iteration (ms): 5630.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.979795E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:31:22.215161 | finish at 2025-09-10 11:44:52 + [2025-09-09 22:13:35] iteration 3274/ 11920 | consumed samples: 3352576 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.986642E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:30:26.279606 | finish at 2025-09-10 11:44:02 + [2025-09-09 22:13:41] iteration 3275/ 11920 | consumed samples: 3353600 | elapsed time per iteration (ms): 5856.1 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.996299E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:03:45.667799 | finish at 2025-09-10 12:17:27 + [2025-09-09 22:13:47] iteration 3276/ 11920 | consumed samples: 3354624 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.009950E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:29:46.168567 | finish at 2025-09-10 11:43:33 + [2025-09-09 22:13:53] iteration 3277/ 11920 | consumed samples: 3355648 | elapsed time per iteration (ms): 5886.9 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.984320E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:08:00.167154 | finish at 2025-09-10 12:21:53 + [2025-09-09 22:13:58] iteration 3278/ 11920 | consumed samples: 3356672 | elapsed time per iteration (ms): 5619.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.993308E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:29:20.131145 | finish at 2025-09-10 11:43:19 + [2025-09-09 22:14:04] iteration 3279/ 11920 | consumed samples: 3357696 | elapsed time per iteration (ms): 5641.4 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.990992E+00 | loss scale: 1.0 | grad norm: 0.266 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:32:27.045709 | finish at 2025-09-10 11:46:31 + [2025-09-09 22:14:10] iteration 3280/ 11920 | consumed samples: 3358720 | elapsed time per iteration (ms): 5864.1 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.998833E+00 | loss scale: 1.0 | grad norm: 0.272 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:04:25.624008 | finish at 2025-09-10 12:18:36 + [2025-09-09 22:14:16] iteration 3281/ 11920 | consumed samples: 3359744 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.007152E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:29:59.989737 | finish at 2025-09-10 11:44:16 + [2025-09-09 22:14:21] iteration 3282/ 11920 | consumed samples: 3360768 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.006867E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:30:25.476345 | finish at 2025-09-10 11:44:47 + [2025-09-09 22:14:27] iteration 3283/ 11920 | consumed samples: 3361792 | elapsed time per iteration (ms): 5636.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.991113E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:31:21.755521 | finish at 2025-09-10 11:45:49 + [2025-09-09 22:14:32] iteration 3284/ 11920 | consumed samples: 3362816 | elapsed time per iteration (ms): 5629.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.986499E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:30:12.986569 | finish at 2025-09-10 11:44:45 + [2025-09-09 22:14:38] iteration 3285/ 11920 | consumed samples: 3363840 | elapsed time per iteration (ms): 6003.6 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.993822E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:24:01.068512 | finish at 2025-09-10 12:38:40 + [2025-09-09 22:14:44] iteration 3286/ 11920 | consumed samples: 3364864 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.995222E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:29:37.125087 | finish at 2025-09-10 11:44:21 + [2025-09-09 22:14:50] iteration 3287/ 11920 | consumed samples: 3365888 | elapsed time per iteration (ms): 5828.2 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.987744E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:58:34.682701 | finish at 2025-09-10 12:13:25 + [2025-09-09 22:14:56] iteration 3288/ 11920 | consumed samples: 3366912 | elapsed time per iteration (ms): 6141.6 | throughput per GPU (TFLOP/s/GPU): 73.5 | MFU 7.43% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.983026E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:43:34.197670 | finish at 2025-09-10 12:58:30 + [2025-09-09 22:15:02] iteration 3289/ 11920 | consumed samples: 3367936 | elapsed time per iteration (ms): 5621.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.975926E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:28:34.639492 | finish at 2025-09-10 11:43:36 + [2025-09-09 22:15:07] iteration 3290/ 11920 | consumed samples: 3368960 | elapsed time per iteration (ms): 5622.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.007091E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:28:41.771226 | finish at 2025-09-10 11:43:49 + [2025-09-09 22:15:13] iteration 3291/ 11920 | consumed samples: 3369984 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.980550E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:28:55.660338 | finish at 2025-09-10 11:44:09 + [2025-09-09 22:15:19] iteration 3292/ 11920 | consumed samples: 3371008 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.991255E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:28:51.062104 | finish at 2025-09-10 11:44:10 + [2025-09-09 22:15:24] iteration 3293/ 11920 | consumed samples: 3372032 | elapsed time per iteration (ms): 5639.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.977078E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:30:50.093939 | finish at 2025-09-10 11:46:14 + [2025-09-09 22:15:30] iteration 3294/ 11920 | consumed samples: 3373056 | elapsed time per iteration (ms): 5651.4 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.999745E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:32:28.639888 | finish at 2025-09-10 11:47:58 + [2025-09-09 22:15:36] iteration 3295/ 11920 | consumed samples: 3374080 | elapsed time per iteration (ms): 5925.2 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.992162E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:11:44.498237 | finish at 2025-09-10 12:27:20 + [2025-09-09 22:15:41] iteration 3296/ 11920 | consumed samples: 3375104 | elapsed time per iteration (ms): 5636.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.974401E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:30:08.981705 | finish at 2025-09-10 11:45:50 + [2025-09-09 22:15:47] iteration 3297/ 11920 | consumed samples: 3376128 | elapsed time per iteration (ms): 5628.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.996998E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:28:52.752359 | finish at 2025-09-10 11:44:40 + [2025-09-09 22:15:53] iteration 3298/ 11920 | consumed samples: 3377152 | elapsed time per iteration (ms): 5627.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.993638E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:28:43.839147 | finish at 2025-09-10 11:44:36 + [2025-09-09 22:15:58] iteration 3299/ 11920 | consumed samples: 3378176 | elapsed time per iteration (ms): 5627.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.983994E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:28:30.376027 | finish at 2025-09-10 11:44:29 + [2025-09-09 22:16:04] iteration 3300/ 11920 | consumed samples: 3379200 | elapsed time per iteration (ms): 5629.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.984134E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:28:42.392645 | finish at 2025-09-10 11:44:46 + [2025-09-09 22:16:10] iteration 3301/ 11920 | consumed samples: 3380224 | elapsed time per iteration (ms): 5627.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.989703E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:28:22.640065 | finish at 2025-09-10 11:44:32 + [2025-09-09 22:16:15] iteration 3302/ 11920 | consumed samples: 3381248 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.988956E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:27:37.827638 | finish at 2025-09-10 11:43:53 + [2025-09-09 22:16:21] iteration 3303/ 11920 | consumed samples: 3382272 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.990284E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:28:01.569071 | finish at 2025-09-10 11:44:22 + [2025-09-09 22:16:26] iteration 3304/ 11920 | consumed samples: 3383296 | elapsed time per iteration (ms): 5632.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.984487E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:28:46.646976 | finish at 2025-09-10 11:45:13 + [2025-09-09 22:16:32] iteration 3305/ 11920 | consumed samples: 3384320 | elapsed time per iteration (ms): 5842.3 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.979170E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:58:51.769305 | finish at 2025-09-10 12:15:24 + [2025-09-09 22:16:38] iteration 3306/ 11920 | consumed samples: 3385344 | elapsed time per iteration (ms): 5627.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.986222E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:27:54.850101 | finish at 2025-09-10 11:44:33 + [2025-09-09 22:16:44] iteration 3307/ 11920 | consumed samples: 3386368 | elapsed time per iteration (ms): 5630.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.985292E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:28:11.420978 | finish at 2025-09-10 11:44:55 + [2025-09-09 22:16:49] iteration 3308/ 11920 | consumed samples: 3387392 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.994141E+00 | loss scale: 1.0 | grad norm: 0.129 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:26:42.264301 | finish at 2025-09-10 11:43:31 + [2025-09-09 22:16:55] iteration 3309/ 11920 | consumed samples: 3388416 | elapsed time per iteration (ms): 5627.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.973879E+00 | loss scale: 1.0 | grad norm: 0.121 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:27:37.602314 | finish at 2025-09-10 11:44:32 + [2025-09-09 22:17:00] iteration 3310/ 11920 | consumed samples: 3389440 | elapsed time per iteration (ms): 5627.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.973361E+00 | loss scale: 1.0 | grad norm: 0.128 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:27:33.496020 | finish at 2025-09-10 11:44:34 + [2025-09-09 22:17:06] iteration 3311/ 11920 | consumed samples: 3390464 | elapsed time per iteration (ms): 5822.9 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.84% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.976497E+00 | loss scale: 1.0 | grad norm: 0.142 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:55:29.443059 | finish at 2025-09-10 12:12:36 + [2025-09-09 22:17:12] iteration 3312/ 11920 | consumed samples: 3391488 | elapsed time per iteration (ms): 5831.5 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.978055E+00 | loss scale: 1.0 | grad norm: 0.142 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:56:37.605820 | finish at 2025-09-10 12:13:50 + [2025-09-09 22:17:18] iteration 3313/ 11920 | consumed samples: 3392512 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.976535E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:26:47.991014 | finish at 2025-09-10 11:44:06 + [2025-09-09 22:17:23] iteration 3314/ 11920 | consumed samples: 3393536 | elapsed time per iteration (ms): 5628.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.996335E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:27:22.303581 | finish at 2025-09-10 11:44:46 + [2025-09-09 22:17:29] iteration 3315/ 11920 | consumed samples: 3394560 | elapsed time per iteration (ms): 5830.9 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.983370E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:56:14.619190 | finish at 2025-09-10 12:13:44 + [2025-09-09 22:17:35] iteration 3316/ 11920 | consumed samples: 3395584 | elapsed time per iteration (ms): 5635.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.992425E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:28:09.492946 | finish at 2025-09-10 11:45:44 + [2025-09-09 22:17:40] iteration 3317/ 11920 | consumed samples: 3396608 | elapsed time per iteration (ms): 5628.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.973628E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:27:03.017080 | finish at 2025-09-10 11:44:43 + [2025-09-09 22:17:46] iteration 3318/ 11920 | consumed samples: 3397632 | elapsed time per iteration (ms): 5639.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.994287E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:28:31.946177 | finish at 2025-09-10 11:46:18 + [2025-09-09 22:17:52] iteration 3319/ 11920 | consumed samples: 3398656 | elapsed time per iteration (ms): 5632.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.989663E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:27:28.603656 | finish at 2025-09-10 11:45:20 + [2025-09-09 22:17:57] iteration 3320/ 11920 | consumed samples: 3399680 | elapsed time per iteration (ms): 5627.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.979825E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:26:32.633438 | finish at 2025-09-10 11:44:30 + [2025-09-09 22:18:03] iteration 3321/ 11920 | consumed samples: 3400704 | elapsed time per iteration (ms): 5637.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.988506E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:27:56.643543 | finish at 2025-09-10 11:46:00 + [2025-09-09 22:18:09] iteration 3322/ 11920 | consumed samples: 3401728 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.982555E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:26:01.607831 | finish at 2025-09-10 11:44:10 + [2025-09-09 22:18:14] iteration 3323/ 11920 | consumed samples: 3402752 | elapsed time per iteration (ms): 5638.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.986284E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:27:54.223231 | finish at 2025-09-10 11:46:08 + [2025-09-09 22:18:20] iteration 3324/ 11920 | consumed samples: 3403776 | elapsed time per iteration (ms): 5625.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.979905E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:26:00.451851 | finish at 2025-09-10 11:44:20 + [2025-09-09 22:18:25] iteration 3325/ 11920 | consumed samples: 3404800 | elapsed time per iteration (ms): 5630.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.979897E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:26:32.828482 | finish at 2025-09-10 11:44:58 + [2025-09-09 22:18:31] iteration 3326/ 11920 | consumed samples: 3405824 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.985793E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:26:02.903507 | finish at 2025-09-10 11:44:34 + [2025-09-09 22:18:37] iteration 3327/ 11920 | consumed samples: 3406848 | elapsed time per iteration (ms): 5626.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.995861E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:25:52.121381 | finish at 2025-09-10 11:44:29 + [2025-09-09 22:18:42] iteration 3328/ 11920 | consumed samples: 3407872 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.974761E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:25:25.548626 | finish at 2025-09-10 11:44:08 + [2025-09-09 22:18:48] iteration 3329/ 11920 | consumed samples: 3408896 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.994427E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:25:20.102342 | finish at 2025-09-10 11:44:08 + [2025-09-09 22:18:54] iteration 3330/ 11920 | consumed samples: 3409920 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.979649E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:25:21.142082 | finish at 2025-09-10 11:44:15 + [2025-09-09 22:18:59] iteration 3331/ 11920 | consumed samples: 3410944 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.989869E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:24:34.608359 | finish at 2025-09-10 11:43:34 + [2025-09-09 22:19:05] iteration 3332/ 11920 | consumed samples: 3411968 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.996292E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:24:47.892769 | finish at 2025-09-10 11:43:53 + [2025-09-09 22:19:10] iteration 3333/ 11920 | consumed samples: 3412992 | elapsed time per iteration (ms): 5632.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.984870E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:26:09.626310 | finish at 2025-09-10 11:45:20 + [2025-09-09 22:19:16] iteration 3334/ 11920 | consumed samples: 3414016 | elapsed time per iteration (ms): 5635.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.007202E+00 | loss scale: 1.0 | grad norm: 0.238 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:26:29.407694 | finish at 2025-09-10 11:45:45 + [2025-09-09 22:19:22] iteration 3335/ 11920 | consumed samples: 3415040 | elapsed time per iteration (ms): 5626.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.976714E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:25:02.989861 | finish at 2025-09-10 11:44:25 + [2025-09-09 22:19:27] iteration 3336/ 11920 | consumed samples: 3416064 | elapsed time per iteration (ms): 5635.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.979384E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:26:13.351078 | finish at 2025-09-10 11:45:41 + [2025-09-09 22:19:33] iteration 3337/ 11920 | consumed samples: 3417088 | elapsed time per iteration (ms): 5622.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.970492E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:24:18.324232 | finish at 2025-09-10 11:43:51 + [2025-09-09 22:19:39] iteration 3338/ 11920 | consumed samples: 3418112 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.987265E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:24:25.352772 | finish at 2025-09-10 11:44:04 + [2025-09-09 22:19:44] iteration 3339/ 11920 | consumed samples: 3419136 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.990532E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:24:49.690514 | finish at 2025-09-10 11:44:34 + [2025-09-09 22:19:50] iteration 3340/ 11920 | consumed samples: 3420160 | elapsed time per iteration (ms): 5628.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.985401E+00 | loss scale: 1.0 | grad norm: 0.133 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:24:52.149382 | finish at 2025-09-10 11:44:42 + [2025-09-09 22:19:55] iteration 3341/ 11920 | consumed samples: 3421184 | elapsed time per iteration (ms): 5631.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.981481E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:25:13.980327 | finish at 2025-09-10 11:45:09 + [2025-09-09 22:20:01] iteration 3342/ 11920 | consumed samples: 3422208 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.989879E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:23:51.248388 | finish at 2025-09-10 11:43:52 + [2025-09-09 22:20:07] iteration 3343/ 11920 | consumed samples: 3423232 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.980875E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:23:28.873767 | finish at 2025-09-10 11:43:36 + [2025-09-09 22:20:12] iteration 3344/ 11920 | consumed samples: 3424256 | elapsed time per iteration (ms): 5619.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.979563E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:23:09.698883 | finish at 2025-09-10 11:43:22 + [2025-09-09 22:20:18] iteration 3345/ 11920 | consumed samples: 3425280 | elapsed time per iteration (ms): 5631.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.972325E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:24:50.779036 | finish at 2025-09-10 11:45:09 + [2025-09-09 22:20:24] iteration 3346/ 11920 | consumed samples: 3426304 | elapsed time per iteration (ms): 5628.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.991822E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:24:15.363451 | finish at 2025-09-10 11:44:39 + [2025-09-09 22:20:29] iteration 3347/ 11920 | consumed samples: 3427328 | elapsed time per iteration (ms): 5639.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.992801E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:25:43.222103 | finish at 2025-09-10 11:46:12 + [2025-09-09 22:20:35] iteration 3348/ 11920 | consumed samples: 3428352 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.970264E+00 | loss scale: 1.0 | grad norm: 0.229 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:23:39.829845 | finish at 2025-09-10 11:44:15 + [2025-09-09 22:20:40] iteration 3349/ 11920 | consumed samples: 3429376 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.979615E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:23:04.649641 | finish at 2025-09-10 11:43:45 + [2025-09-09 22:20:46] iteration 3350/ 11920 | consumed samples: 3430400 | elapsed time per iteration (ms): 5623.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.986309E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:23:09.213405 | finish at 2025-09-10 11:43:55 + [2025-09-09 22:20:52] iteration 3351/ 11920 | consumed samples: 3431424 | elapsed time per iteration (ms): 5616.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.974836E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:22:03.082601 | finish at 2025-09-10 11:42:55 + [2025-09-09 22:20:58] iteration 3352/ 11920 | consumed samples: 3432448 | elapsed time per iteration (ms): 5830.6 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.979381E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:52:36.315439 | finish at 2025-09-10 12:13:34 + [2025-09-09 22:21:03] iteration 3353/ 11920 | consumed samples: 3433472 | elapsed time per iteration (ms): 5934.4 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.977948E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:07:20.038180 | finish at 2025-09-10 12:28:24 + [2025-09-09 22:21:09] iteration 3354/ 11920 | consumed samples: 3434496 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.986166E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:22:40.576095 | finish at 2025-09-10 11:43:50 + [2025-09-09 22:21:15] iteration 3355/ 11920 | consumed samples: 3435520 | elapsed time per iteration (ms): 5948.5 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.992647E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:09:08.785400 | finish at 2025-09-10 12:30:24 + [2025-09-09 22:21:21] iteration 3356/ 11920 | consumed samples: 3436544 | elapsed time per iteration (ms): 5617.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.987903E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:21:44.299238 | finish at 2025-09-10 11:43:05 + [2025-09-09 22:21:26] iteration 3357/ 11920 | consumed samples: 3437568 | elapsed time per iteration (ms): 5631.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.975037E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:23:39.466057 | finish at 2025-09-10 11:45:06 + [2025-09-09 22:21:32] iteration 3358/ 11920 | consumed samples: 3438592 | elapsed time per iteration (ms): 5846.7 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.984785E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:54:19.083681 | finish at 2025-09-10 12:15:51 + [2025-09-09 22:21:38] iteration 3359/ 11920 | consumed samples: 3439616 | elapsed time per iteration (ms): 5638.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.980407E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:24:31.361576 | finish at 2025-09-10 11:46:09 + [2025-09-09 22:21:43] iteration 3360/ 11920 | consumed samples: 3440640 | elapsed time per iteration (ms): 5641.7 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.972801E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:24:52.933884 | finish at 2025-09-10 11:46:36 + [2025-09-09 22:21:49] iteration 3361/ 11920 | consumed samples: 3441664 | elapsed time per iteration (ms): 5629.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.976085E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:23:05.532358 | finish at 2025-09-10 11:44:55 + [2025-09-09 22:21:55] iteration 3362/ 11920 | consumed samples: 3442688 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.975007E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:22:09.239760 | finish at 2025-09-10 11:44:04 + [2025-09-09 22:22:01] iteration 3363/ 11920 | consumed samples: 3443712 | elapsed time per iteration (ms): 5954.9 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.989337E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:09:15.693937 | finish at 2025-09-10 12:31:16 + [2025-09-09 22:22:06] iteration 3364/ 11920 | consumed samples: 3444736 | elapsed time per iteration (ms): 5829.9 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.985806E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:51:20.347198 | finish at 2025-09-10 12:13:27 + [2025-09-09 22:22:12] iteration 3365/ 11920 | consumed samples: 3445760 | elapsed time per iteration (ms): 5634.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.975930E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:23:22.331860 | finish at 2025-09-10 11:45:34 + [2025-09-09 22:22:18] iteration 3366/ 11920 | consumed samples: 3446784 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.976880E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:21:21.269653 | finish at 2025-09-10 11:43:39 + [2025-09-09 22:22:23] iteration 3367/ 11920 | consumed samples: 3447808 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.998727E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:21:38.944496 | finish at 2025-09-10 11:44:02 + [2025-09-09 22:22:29] iteration 3368/ 11920 | consumed samples: 3448832 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.998926E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:22:22.092682 | finish at 2025-09-10 11:44:51 + [2025-09-09 22:22:35] iteration 3369/ 11920 | consumed samples: 3449856 | elapsed time per iteration (ms): 5896.0 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.975174E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:00:16.854237 | finish at 2025-09-10 12:22:52 + [2025-09-09 22:22:41] iteration 3370/ 11920 | consumed samples: 3450880 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.980228E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:22:10.756545 | finish at 2025-09-10 11:44:51 + [2025-09-09 22:22:46] iteration 3371/ 11920 | consumed samples: 3451904 | elapsed time per iteration (ms): 5927.6 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.981004E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:04:34.821283 | finish at 2025-09-10 12:27:21 + [2025-09-09 22:22:52] iteration 3372/ 11920 | consumed samples: 3452928 | elapsed time per iteration (ms): 5626.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.986553E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:21:37.088018 | finish at 2025-09-10 11:44:29 + [2025-09-09 22:22:58] iteration 3373/ 11920 | consumed samples: 3453952 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.960791E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:20:52.185456 | finish at 2025-09-10 11:43:50 + [2025-09-09 22:23:04] iteration 3374/ 11920 | consumed samples: 3454976 | elapsed time per iteration (ms): 6629.5 | throughput per GPU (TFLOP/s/GPU): 68.1 | MFU 6.89% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.990786E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 15:44:15.783676 | finish at 2025-09-10 14:07:20 + [2025-09-09 22:23:10] iteration 3375/ 11920 | consumed samples: 3456000 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.969258E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:20:22.350992 | finish at 2025-09-10 11:43:32 + [2025-09-09 22:23:16] iteration 3376/ 11920 | consumed samples: 3457024 | elapsed time per iteration (ms): 5617.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.973172E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:19:51.567398 | finish at 2025-09-10 11:43:07 + [2025-09-09 22:23:21] iteration 3377/ 11920 | consumed samples: 3458048 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.981588E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:20:27.578729 | finish at 2025-09-10 11:43:49 + [2025-09-09 22:23:27] iteration 3378/ 11920 | consumed samples: 3459072 | elapsed time per iteration (ms): 5627.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.005565E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:21:11.926184 | finish at 2025-09-10 11:44:39 + [2025-09-09 22:23:32] iteration 3379/ 11920 | consumed samples: 3460096 | elapsed time per iteration (ms): 5634.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.984697E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:22:04.568142 | finish at 2025-09-10 11:45:37 + [2025-09-09 22:23:38] iteration 3380/ 11920 | consumed samples: 3461120 | elapsed time per iteration (ms): 5628.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.985768E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:21:10.077515 | finish at 2025-09-10 11:44:48 + [2025-09-09 22:23:44] iteration 3381/ 11920 | consumed samples: 3462144 | elapsed time per iteration (ms): 5921.3 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.975013E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:02:42.047323 | finish at 2025-09-10 12:26:26 + [2025-09-09 22:23:50] iteration 3382/ 11920 | consumed samples: 3463168 | elapsed time per iteration (ms): 5630.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.984166E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:21:09.376599 | finish at 2025-09-10 11:44:59 + [2025-09-09 22:23:55] iteration 3383/ 11920 | consumed samples: 3464192 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.988636E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:20:15.497879 | finish at 2025-09-10 11:44:11 + [2025-09-09 22:24:01] iteration 3384/ 11920 | consumed samples: 3465216 | elapsed time per iteration (ms): 5626.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.992640E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:20:31.036911 | finish at 2025-09-10 11:44:32 + [2025-09-09 22:24:07] iteration 3385/ 11920 | consumed samples: 3466240 | elapsed time per iteration (ms): 5626.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.969821E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:20:19.862888 | finish at 2025-09-10 11:44:26 + [2025-09-09 22:24:12] iteration 3386/ 11920 | consumed samples: 3467264 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.972666E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:19:39.979019 | finish at 2025-09-10 11:43:52 + [2025-09-09 22:24:18] iteration 3387/ 11920 | consumed samples: 3468288 | elapsed time per iteration (ms): 5628.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.968701E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:20:23.709938 | finish at 2025-09-10 11:44:41 + [2025-09-09 22:24:23] iteration 3388/ 11920 | consumed samples: 3469312 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.988099E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:19:40.189098 | finish at 2025-09-10 11:44:04 + [2025-09-09 22:24:29] iteration 3389/ 11920 | consumed samples: 3470336 | elapsed time per iteration (ms): 5634.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.993997E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:21:03.512161 | finish at 2025-09-10 11:45:33 + [2025-09-09 22:24:35] iteration 3390/ 11920 | consumed samples: 3471360 | elapsed time per iteration (ms): 5631.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.971712E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:20:36.703184 | finish at 2025-09-10 11:45:11 + [2025-09-09 22:24:41] iteration 3391/ 11920 | consumed samples: 3472384 | elapsed time per iteration (ms): 5860.9 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.990515E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:53:07.475153 | finish at 2025-09-10 12:17:48 + [2025-09-09 22:24:46] iteration 3392/ 11920 | consumed samples: 3473408 | elapsed time per iteration (ms): 5638.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.975519E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:21:28.704247 | finish at 2025-09-10 11:46:15 + [2025-09-09 22:24:52] iteration 3393/ 11920 | consumed samples: 3474432 | elapsed time per iteration (ms): 5936.2 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.971308E+00 | loss scale: 1.0 | grad norm: 0.250 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:03:38.332351 | finish at 2025-09-10 12:28:30 + [2025-09-09 22:24:58] iteration 3394/ 11920 | consumed samples: 3475456 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.970821E+00 | loss scale: 1.0 | grad norm: 0.272 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:18:46.587722 | finish at 2025-09-10 11:43:44 + [2025-09-09 22:25:03] iteration 3395/ 11920 | consumed samples: 3476480 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.969921E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:18:29.141302 | finish at 2025-09-10 11:43:32 + [2025-09-09 22:25:09] iteration 3396/ 11920 | consumed samples: 3477504 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.962972E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:18:39.976832 | finish at 2025-09-10 11:43:49 + [2025-09-09 22:25:15] iteration 3397/ 11920 | consumed samples: 3478528 | elapsed time per iteration (ms): 5847.5 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.976268E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:50:38.082700 | finish at 2025-09-10 12:15:53 + [2025-09-09 22:25:20] iteration 3398/ 11920 | consumed samples: 3479552 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.971208E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:18:58.076591 | finish at 2025-09-10 11:44:18 + [2025-09-09 22:25:26] iteration 3399/ 11920 | consumed samples: 3480576 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.972379E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:18:31.881784 | finish at 2025-09-10 11:43:58 + [2025-09-09 22:25:32] iteration 3400/ 11920 | consumed samples: 3481600 | elapsed time per iteration (ms): 5640.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.962782E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:20:55.504589 | finish at 2025-09-10 11:46:27 + [2025-09-09 22:25:37] iteration 3401/ 11920 | consumed samples: 3482624 | elapsed time per iteration (ms): 5627.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.983873E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:18:59.933671 | finish at 2025-09-10 11:44:37 + [2025-09-09 22:25:43] iteration 3402/ 11920 | consumed samples: 3483648 | elapsed time per iteration (ms): 5636.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.971776E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:20:09.796994 | finish at 2025-09-10 11:45:53 + [2025-09-09 22:25:49] iteration 3403/ 11920 | consumed samples: 3484672 | elapsed time per iteration (ms): 5635.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.969450E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:19:56.082948 | finish at 2025-09-10 11:45:45 + [2025-09-09 22:25:54] iteration 3404/ 11920 | consumed samples: 3485696 | elapsed time per iteration (ms): 5891.6 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.967880E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:56:12.755054 | finish at 2025-09-10 12:22:07 + [2025-09-09 22:26:00] iteration 3405/ 11920 | consumed samples: 3486720 | elapsed time per iteration (ms): 5632.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.971963E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:19:20.911523 | finish at 2025-09-10 11:45:21 + [2025-09-09 22:26:06] iteration 3406/ 11920 | consumed samples: 3487744 | elapsed time per iteration (ms): 5980.7 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.986201E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:08:39.586561 | finish at 2025-09-10 12:34:46 + [2025-09-09 22:26:12] iteration 3407/ 11920 | consumed samples: 3488768 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.951061E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:17:43.781827 | finish at 2025-09-10 11:43:55 + [2025-09-09 22:26:17] iteration 3408/ 11920 | consumed samples: 3489792 | elapsed time per iteration (ms): 5619.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.970712E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:17:11.271622 | finish at 2025-09-10 11:43:29 + [2025-09-09 22:26:23] iteration 3409/ 11920 | consumed samples: 3490816 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.984969E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:17:12.237037 | finish at 2025-09-10 11:43:35 + [2025-09-09 22:26:29] iteration 3410/ 11920 | consumed samples: 3491840 | elapsed time per iteration (ms): 5854.1 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.989669E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:50:18.792636 | finish at 2025-09-10 12:16:48 + [2025-09-09 22:26:34] iteration 3411/ 11920 | consumed samples: 3492864 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.993549E+00 | loss scale: 1.0 | grad norm: 0.394 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:17:45.003580 | finish at 2025-09-10 11:44:19 + [2025-09-09 22:26:40] iteration 3412/ 11920 | consumed samples: 3493888 | elapsed time per iteration (ms): 5641.6 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.005089E+00 | loss scale: 1.0 | grad norm: 0.352 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:19:58.545276 | finish at 2025-09-10 11:46:39 + [2025-09-09 22:26:46] iteration 3413/ 11920 | consumed samples: 3494912 | elapsed time per iteration (ms): 5643.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.013929E+00 | loss scale: 1.0 | grad norm: 0.328 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:20:07.229064 | finish at 2025-09-10 11:46:53 + [2025-09-09 22:26:52] iteration 3414/ 11920 | consumed samples: 3495936 | elapsed time per iteration (ms): 5999.6 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.015604E+00 | loss scale: 1.0 | grad norm: 0.380 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:10:32.761302 | finish at 2025-09-10 12:37:24 + [2025-09-09 22:26:57] iteration 3415/ 11920 | consumed samples: 3496960 | elapsed time per iteration (ms): 5643.8 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.012847E+00 | loss scale: 1.0 | grad norm: 0.304 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:20:00.653003 | finish at 2025-09-10 11:46:58 + [2025-09-09 22:27:03] iteration 3416/ 11920 | consumed samples: 3497984 | elapsed time per iteration (ms): 5635.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.015565E+00 | loss scale: 1.0 | grad norm: 0.265 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:18:39.843250 | finish at 2025-09-10 11:45:43 + [2025-09-09 22:27:09] iteration 3417/ 11920 | consumed samples: 3499008 | elapsed time per iteration (ms): 5641.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.009940E+00 | loss scale: 1.0 | grad norm: 0.263 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:19:27.580294 | finish at 2025-09-10 11:46:36 + [2025-09-09 22:27:14] iteration 3418/ 11920 | consumed samples: 3500032 | elapsed time per iteration (ms): 5648.2 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.020828E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:20:20.785887 | finish at 2025-09-10 11:47:35 + [2025-09-09 22:27:20] iteration 3419/ 11920 | consumed samples: 3501056 | elapsed time per iteration (ms): 5638.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.004461E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:18:55.555553 | finish at 2025-09-10 11:46:15 + [2025-09-09 22:27:26] iteration 3420/ 11920 | consumed samples: 3502080 | elapsed time per iteration (ms): 5643.2 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.008910E+00 | loss scale: 1.0 | grad norm: 0.243 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:19:27.245936 | finish at 2025-09-10 11:46:53 + [2025-09-09 22:27:31] iteration 3421/ 11920 | consumed samples: 3503104 | elapsed time per iteration (ms): 5867.4 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.991150E+00 | loss scale: 1.0 | grad norm: 0.344 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:51:07.125224 | finish at 2025-09-10 12:18:39 + [2025-09-09 22:27:37] iteration 3422/ 11920 | consumed samples: 3504128 | elapsed time per iteration (ms): 5664.0 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.508998E+00 | loss scale: 1.0 | grad norm: 4.474 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:22:12.423084 | finish at 2025-09-10 11:49:50 + [2025-09-09 22:27:43] iteration 3423/ 11920 | consumed samples: 3505152 | elapsed time per iteration (ms): 5676.8 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.171930E+00 | loss scale: 1.0 | grad norm: 0.878 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:23:56.185005 | finish at 2025-09-10 11:51:39 + [2025-09-09 22:27:49] iteration 3424/ 11920 | consumed samples: 3506176 | elapsed time per iteration (ms): 5908.0 | throughput per GPU (TFLOP/s/GPU): 76.4 | MFU 7.73% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.145271E+00 | loss scale: 1.0 | grad norm: 0.561 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:56:34.029659 | finish at 2025-09-10 12:24:23 + [2025-09-09 22:27:54] iteration 3425/ 11920 | consumed samples: 3507200 | elapsed time per iteration (ms): 5674.7 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.147542E+00 | loss scale: 1.0 | grad norm: 0.521 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:23:26.428833 | finish at 2025-09-10 11:51:21 + [2025-09-09 22:28:00] iteration 3426/ 11920 | consumed samples: 3508224 | elapsed time per iteration (ms): 6132.7 | throughput per GPU (TFLOP/s/GPU): 73.6 | MFU 7.44% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.180346E+00 | loss scale: 1.0 | grad norm: 0.715 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:28:10.928726 | finish at 2025-09-10 12:56:11 + [2025-09-09 22:28:07] iteration 3427/ 11920 | consumed samples: 3509248 | elapsed time per iteration (ms): 6052.3 | throughput per GPU (TFLOP/s/GPU): 74.6 | MFU 7.54% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.181592E+00 | loss scale: 1.0 | grad norm: 0.541 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:16:42.163450 | finish at 2025-09-10 12:44:49 + [2025-09-09 22:28:12] iteration 3428/ 11920 | consumed samples: 3510272 | elapsed time per iteration (ms): 5911.6 | throughput per GPU (TFLOP/s/GPU): 76.4 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.176269E+00 | loss scale: 1.0 | grad norm: 0.512 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:56:41.522772 | finish at 2025-09-10 12:24:54 + [2025-09-09 22:28:18] iteration 3429/ 11920 | consumed samples: 3511296 | elapsed time per iteration (ms): 5918.9 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.148377E+00 | loss scale: 1.0 | grad norm: 0.395 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:57:37.471109 | finish at 2025-09-10 12:25:56 + [2025-09-09 22:28:24] iteration 3430/ 11920 | consumed samples: 3512320 | elapsed time per iteration (ms): 5658.8 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.164298E+00 | loss scale: 1.0 | grad norm: 0.380 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:20:43.168530 | finish at 2025-09-10 11:49:07 + [2025-09-09 22:28:30] iteration 3431/ 11920 | consumed samples: 3513344 | elapsed time per iteration (ms): 5677.7 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.146291E+00 | loss scale: 1.0 | grad norm: 0.457 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:23:17.732551 | finish at 2025-09-10 11:51:47 + [2025-09-09 22:28:35] iteration 3432/ 11920 | consumed samples: 3514368 | elapsed time per iteration (ms): 5674.3 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.170412E+00 | loss scale: 1.0 | grad norm: 0.614 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:22:43.478258 | finish at 2025-09-10 11:51:19 + [2025-09-09 22:28:41] iteration 3433/ 11920 | consumed samples: 3515392 | elapsed time per iteration (ms): 5686.6 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.177690E+00 | loss scale: 1.0 | grad norm: 0.495 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:24:22.074795 | finish at 2025-09-10 11:53:03 + [2025-09-09 22:28:47] iteration 3434/ 11920 | consumed samples: 3516416 | elapsed time per iteration (ms): 5662.2 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.147982E+00 | loss scale: 1.0 | grad norm: 0.417 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:20:49.123474 | finish at 2025-09-10 11:49:36 + [2025-09-09 22:28:52] iteration 3435/ 11920 | consumed samples: 3517440 | elapsed time per iteration (ms): 5649.6 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.142723E+00 | loss scale: 1.0 | grad norm: 0.469 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:18:56.496155 | finish at 2025-09-10 11:47:49 + [2025-09-09 22:28:58] iteration 3436/ 11920 | consumed samples: 3518464 | elapsed time per iteration (ms): 5653.8 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.131630E+00 | loss scale: 1.0 | grad norm: 0.456 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:19:26.893905 | finish at 2025-09-10 11:48:25 + [2025-09-09 22:29:04] iteration 3437/ 11920 | consumed samples: 3519488 | elapsed time per iteration (ms): 5653.1 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.124415E+00 | loss scale: 1.0 | grad norm: 0.360 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:19:15.492140 | finish at 2025-09-10 11:48:19 + [2025-09-09 22:29:09] iteration 3438/ 11920 | consumed samples: 3520512 | elapsed time per iteration (ms): 5638.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.113583E+00 | loss scale: 1.0 | grad norm: 0.376 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:17:06.060130 | finish at 2025-09-10 11:46:15 + [2025-09-09 22:29:15] iteration 3439/ 11920 | consumed samples: 3521536 | elapsed time per iteration (ms): 5636.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.109967E+00 | loss scale: 1.0 | grad norm: 0.362 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:16:44.152357 | finish at 2025-09-10 11:45:59 + [2025-09-09 22:29:21] iteration 3440/ 11920 | consumed samples: 3522560 | elapsed time per iteration (ms): 5642.5 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.109846E+00 | loss scale: 1.0 | grad norm: 0.589 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:17:28.122368 | finish at 2025-09-10 11:46:49 + [2025-09-09 22:29:26] iteration 3441/ 11920 | consumed samples: 3523584 | elapsed time per iteration (ms): 5654.6 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.126928E+00 | loss scale: 1.0 | grad norm: 0.842 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:19:05.544642 | finish at 2025-09-10 11:48:32 + [2025-09-09 22:29:32] iteration 3442/ 11920 | consumed samples: 3524608 | elapsed time per iteration (ms): 5652.2 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.126456E+00 | loss scale: 1.0 | grad norm: 0.599 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:18:38.955284 | finish at 2025-09-10 11:48:11 + [2025-09-09 22:29:38] iteration 3443/ 11920 | consumed samples: 3525632 | elapsed time per iteration (ms): 5648.9 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.121556E+00 | loss scale: 1.0 | grad norm: 0.621 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:18:05.375926 | finish at 2025-09-10 11:47:43 + [2025-09-09 22:29:44] iteration 3444/ 11920 | consumed samples: 3526656 | elapsed time per iteration (ms): 6040.1 | throughput per GPU (TFLOP/s/GPU): 74.7 | MFU 7.56% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.113928E+00 | loss scale: 1.0 | grad norm: 0.382 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:13:15.692407 | finish at 2025-09-10 12:42:59 + [2025-09-09 22:29:49] iteration 3445/ 11920 | consumed samples: 3527680 | elapsed time per iteration (ms): 5653.1 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.112343E+00 | loss scale: 1.0 | grad norm: 0.670 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:18:29.964019 | finish at 2025-09-10 11:48:19 + [2025-09-09 22:29:55] iteration 3446/ 11920 | consumed samples: 3528704 | elapsed time per iteration (ms): 5664.3 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.124280E+00 | loss scale: 1.0 | grad norm: 0.702 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:19:59.677934 | finish at 2025-09-10 11:49:55 + [2025-09-09 22:30:01] iteration 3447/ 11920 | consumed samples: 3529728 | elapsed time per iteration (ms): 5900.9 | throughput per GPU (TFLOP/s/GPU): 76.5 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.125063E+00 | loss scale: 1.0 | grad norm: 0.881 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:53:18.165154 | finish at 2025-09-10 12:23:19 + [2025-09-09 22:30:06] iteration 3448/ 11920 | consumed samples: 3530752 | elapsed time per iteration (ms): 5668.2 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.099930E+00 | loss scale: 1.0 | grad norm: 0.440 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:20:21.055172 | finish at 2025-09-10 11:50:28 + [2025-09-09 22:30:12] iteration 3449/ 11920 | consumed samples: 3531776 | elapsed time per iteration (ms): 5854.9 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.100545E+00 | loss scale: 1.0 | grad norm: 0.372 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:46:36.590318 | finish at 2025-09-10 12:16:49 + [2025-09-09 22:30:18] iteration 3450/ 11920 | consumed samples: 3532800 | elapsed time per iteration (ms): 5643.8 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.090003E+00 | loss scale: 1.0 | grad norm: 0.363 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:16:43.180034 | finish at 2025-09-10 11:47:01 + [2025-09-09 22:30:24] iteration 3451/ 11920 | consumed samples: 3533824 | elapsed time per iteration (ms): 5644.7 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.083027E+00 | loss scale: 1.0 | grad norm: 0.363 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:16:45.249429 | finish at 2025-09-10 11:47:09 + [2025-09-09 22:30:29] iteration 3452/ 11920 | consumed samples: 3534848 | elapsed time per iteration (ms): 5646.8 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.083301E+00 | loss scale: 1.0 | grad norm: 0.443 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:16:56.680793 | finish at 2025-09-10 11:47:26 + [2025-09-09 22:30:35] iteration 3453/ 11920 | consumed samples: 3535872 | elapsed time per iteration (ms): 5641.5 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.091220E+00 | loss scale: 1.0 | grad norm: 0.756 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:16:06.653141 | finish at 2025-09-10 11:46:42 + [2025-09-09 22:30:41] iteration 3454/ 11920 | consumed samples: 3536896 | elapsed time per iteration (ms): 5961.7 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.087123E+00 | loss scale: 1.0 | grad norm: 0.445 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:01:11.889138 | finish at 2025-09-10 12:31:53 + [2025-09-09 22:30:47] iteration 3455/ 11920 | consumed samples: 3537920 | elapsed time per iteration (ms): 5644.0 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.079121E+00 | loss scale: 1.0 | grad norm: 0.720 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:16:16.872168 | finish at 2025-09-10 11:47:03 + [2025-09-09 22:30:52] iteration 3456/ 11920 | consumed samples: 3538944 | elapsed time per iteration (ms): 5641.8 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.069956E+00 | loss scale: 1.0 | grad norm: 0.600 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:15:52.543690 | finish at 2025-09-10 11:46:45 + [2025-09-09 22:30:58] iteration 3457/ 11920 | consumed samples: 3539968 | elapsed time per iteration (ms): 5630.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.069321E+00 | loss scale: 1.0 | grad norm: 0.339 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:14:08.428240 | finish at 2025-09-10 11:45:06 + [2025-09-09 22:31:03] iteration 3458/ 11920 | consumed samples: 3540992 | elapsed time per iteration (ms): 5648.6 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.059329E+00 | loss scale: 1.0 | grad norm: 0.384 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:16:38.256344 | finish at 2025-09-10 11:47:42 + [2025-09-09 22:31:09] iteration 3459/ 11920 | consumed samples: 3542016 | elapsed time per iteration (ms): 5633.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.061677E+00 | loss scale: 1.0 | grad norm: 0.443 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:14:23.442635 | finish at 2025-09-10 11:45:33 + [2025-09-09 22:31:15] iteration 3460/ 11920 | consumed samples: 3543040 | elapsed time per iteration (ms): 5625.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.060934E+00 | loss scale: 1.0 | grad norm: 0.376 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:13:10.557804 | finish at 2025-09-10 11:44:25 + [2025-09-09 22:31:20] iteration 3461/ 11920 | consumed samples: 3544064 | elapsed time per iteration (ms): 5638.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.042174E+00 | loss scale: 1.0 | grad norm: 0.384 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:14:53.552329 | finish at 2025-09-10 11:46:14 + [2025-09-09 22:31:26] iteration 3462/ 11920 | consumed samples: 3545088 | elapsed time per iteration (ms): 5632.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.048206E+00 | loss scale: 1.0 | grad norm: 0.294 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:13:59.807445 | finish at 2025-09-10 11:45:26 + [2025-09-09 22:31:32] iteration 3463/ 11920 | consumed samples: 3546112 | elapsed time per iteration (ms): 5648.4 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.053714E+00 | loss scale: 1.0 | grad norm: 0.281 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:16:08.533492 | finish at 2025-09-10 11:47:40 + [2025-09-09 22:31:37] iteration 3464/ 11920 | consumed samples: 3547136 | elapsed time per iteration (ms): 5639.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.039052E+00 | loss scale: 1.0 | grad norm: 0.306 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:14:44.191927 | finish at 2025-09-10 11:46:21 + [2025-09-09 22:31:43] iteration 3465/ 11920 | consumed samples: 3548160 | elapsed time per iteration (ms): 5637.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.035799E+00 | loss scale: 1.0 | grad norm: 0.385 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:14:23.960245 | finish at 2025-09-10 11:46:07 + [2025-09-09 22:31:49] iteration 3466/ 11920 | consumed samples: 3549184 | elapsed time per iteration (ms): 6350.2 | throughput per GPU (TFLOP/s/GPU): 71.1 | MFU 7.19% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.045465E+00 | loss scale: 1.0 | grad norm: 0.456 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:54:44.414918 | finish at 2025-09-10 13:26:34 + [2025-09-09 22:31:55] iteration 3467/ 11920 | consumed samples: 3550208 | elapsed time per iteration (ms): 5649.6 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.044425E+00 | loss scale: 1.0 | grad norm: 0.276 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:15:55.881617 | finish at 2025-09-10 11:47:51 + [2025-09-09 22:32:01] iteration 3468/ 11920 | consumed samples: 3551232 | elapsed time per iteration (ms): 5647.0 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.036518E+00 | loss scale: 1.0 | grad norm: 0.247 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:15:28.503067 | finish at 2025-09-10 11:47:29 + [2025-09-09 22:32:06] iteration 3469/ 11920 | consumed samples: 3552256 | elapsed time per iteration (ms): 5637.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.028028E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:13:59.563125 | finish at 2025-09-10 11:46:06 + [2025-09-09 22:32:12] iteration 3470/ 11920 | consumed samples: 3553280 | elapsed time per iteration (ms): 5833.5 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.026618E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:41:33.094373 | finish at 2025-09-10 12:13:45 + [2025-09-09 22:32:18] iteration 3471/ 11920 | consumed samples: 3554304 | elapsed time per iteration (ms): 5644.4 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.018556E+00 | loss scale: 1.0 | grad norm: 0.310 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:14:49.496324 | finish at 2025-09-10 11:47:07 + [2025-09-09 22:32:23] iteration 3472/ 11920 | consumed samples: 3555328 | elapsed time per iteration (ms): 5635.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.033982E+00 | loss scale: 1.0 | grad norm: 0.461 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:13:26.254395 | finish at 2025-09-10 11:45:50 + [2025-09-09 22:32:29] iteration 3473/ 11920 | consumed samples: 3556352 | elapsed time per iteration (ms): 5629.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.021973E+00 | loss scale: 1.0 | grad norm: 0.359 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:12:29.338695 | finish at 2025-09-10 11:44:58 + [2025-09-09 22:32:35] iteration 3474/ 11920 | consumed samples: 3557376 | elapsed time per iteration (ms): 5964.9 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.026807E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:59:39.124682 | finish at 2025-09-10 12:32:14 + [2025-09-09 22:32:41] iteration 3475/ 11920 | consumed samples: 3558400 | elapsed time per iteration (ms): 5635.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.039783E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:13:14.613923 | finish at 2025-09-10 11:45:55 + [2025-09-09 22:32:46] iteration 3476/ 11920 | consumed samples: 3559424 | elapsed time per iteration (ms): 5631.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.028293E+00 | loss scale: 1.0 | grad norm: 0.303 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:12:35.824605 | finish at 2025-09-10 11:45:22 + [2025-09-09 22:32:52] iteration 3477/ 11920 | consumed samples: 3560448 | elapsed time per iteration (ms): 5980.0 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.022216E+00 | loss scale: 1.0 | grad norm: 0.794 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:01:29.047565 | finish at 2025-09-10 12:34:21 + [2025-09-09 22:32:58] iteration 3478/ 11920 | consumed samples: 3561472 | elapsed time per iteration (ms): 5627.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.026198E+00 | loss scale: 1.0 | grad norm: 0.415 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:11:49.233360 | finish at 2025-09-10 11:44:47 + [2025-09-09 22:33:03] iteration 3479/ 11920 | consumed samples: 3562496 | elapsed time per iteration (ms): 5631.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.026965E+00 | loss scale: 1.0 | grad norm: 0.321 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:12:11.468578 | finish at 2025-09-10 11:45:15 + [2025-09-09 22:33:09] iteration 3480/ 11920 | consumed samples: 3563520 | elapsed time per iteration (ms): 5631.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.018637E+00 | loss scale: 1.0 | grad norm: 0.244 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:12:06.696787 | finish at 2025-09-10 11:45:16 + [2025-09-09 22:33:15] iteration 3481/ 11920 | consumed samples: 3564544 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.002572E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:11:45.921229 | finish at 2025-09-10 11:45:01 + [2025-09-09 22:33:20] iteration 3482/ 11920 | consumed samples: 3565568 | elapsed time per iteration (ms): 5638.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.999292E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:12:54.126089 | finish at 2025-09-10 11:46:14 + [2025-09-09 22:33:26] iteration 3483/ 11920 | consumed samples: 3566592 | elapsed time per iteration (ms): 5642.6 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.010778E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:13:26.405491 | finish at 2025-09-10 11:46:52 + [2025-09-09 22:33:32] iteration 3484/ 11920 | consumed samples: 3567616 | elapsed time per iteration (ms): 5642.4 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.005463E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:13:19.344950 | finish at 2025-09-10 11:46:51 + [2025-09-09 22:33:37] iteration 3485/ 11920 | consumed samples: 3568640 | elapsed time per iteration (ms): 5646.5 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.008066E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:13:48.640701 | finish at 2025-09-10 11:47:26 + [2025-09-09 22:33:43] iteration 3486/ 11920 | consumed samples: 3569664 | elapsed time per iteration (ms): 5990.3 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.011511E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:02:02.328442 | finish at 2025-09-10 12:35:46 + [2025-09-09 22:33:49] iteration 3487/ 11920 | consumed samples: 3570688 | elapsed time per iteration (ms): 5891.0 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.014848E+00 | loss scale: 1.0 | grad norm: 0.272 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:47:59.038512 | finish at 2025-09-10 12:21:48 + [2025-09-09 22:33:55] iteration 3488/ 11920 | consumed samples: 3571712 | elapsed time per iteration (ms): 5642.2 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.006958E+00 | loss scale: 1.0 | grad norm: 0.290 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:12:55.110756 | finish at 2025-09-10 11:46:50 + [2025-09-09 22:34:00] iteration 3489/ 11920 | consumed samples: 3572736 | elapsed time per iteration (ms): 5653.0 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.997751E+00 | loss scale: 1.0 | grad norm: 0.247 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:14:20.413829 | finish at 2025-09-10 11:48:21 + [2025-09-09 22:34:06] iteration 3490/ 11920 | consumed samples: 3573760 | elapsed time per iteration (ms): 5634.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.995900E+00 | loss scale: 1.0 | grad norm: 0.249 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:11:37.776024 | finish at 2025-09-10 11:45:44 + [2025-09-09 22:34:12] iteration 3491/ 11920 | consumed samples: 3574784 | elapsed time per iteration (ms): 5629.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.003452E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:10:48.797946 | finish at 2025-09-10 11:45:00 + [2025-09-09 22:34:17] iteration 3492/ 11920 | consumed samples: 3575808 | elapsed time per iteration (ms): 5636.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.995172E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:11:45.297098 | finish at 2025-09-10 11:46:03 + [2025-09-09 22:34:23] iteration 3493/ 11920 | consumed samples: 3576832 | elapsed time per iteration (ms): 5639.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.998827E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:11:59.934861 | finish at 2025-09-10 11:46:23 + [2025-09-09 22:34:29] iteration 3494/ 11920 | consumed samples: 3577856 | elapsed time per iteration (ms): 5638.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.000707E+00 | loss scale: 1.0 | grad norm: 0.280 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:11:45.502831 | finish at 2025-09-10 11:46:14 + [2025-09-09 22:34:34] iteration 3495/ 11920 | consumed samples: 3578880 | elapsed time per iteration (ms): 5633.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.009180E+00 | loss scale: 1.0 | grad norm: 0.362 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:11:03.100058 | finish at 2025-09-10 11:45:37 + [2025-09-09 22:34:40] iteration 3496/ 11920 | consumed samples: 3579904 | elapsed time per iteration (ms): 5859.8 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.994915E+00 | loss scale: 1.0 | grad norm: 0.299 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:42:42.986172 | finish at 2025-09-10 12:17:23 + [2025-09-09 22:34:46] iteration 3497/ 11920 | consumed samples: 3580928 | elapsed time per iteration (ms): 5643.8 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.001835E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:12:18.091054 | finish at 2025-09-10 11:47:04 + [2025-09-09 22:34:51] iteration 3498/ 11920 | consumed samples: 3581952 | elapsed time per iteration (ms): 5645.5 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.008135E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:12:26.649521 | finish at 2025-09-10 11:47:18 + [2025-09-09 22:34:57] iteration 3499/ 11920 | consumed samples: 3582976 | elapsed time per iteration (ms): 5630.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.986017E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:10:12.349111 | finish at 2025-09-10 11:45:09 + [2025-09-09 22:35:03] iteration 3500/ 11920 | consumed samples: 3584000 | elapsed time per iteration (ms): 5630.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.011095E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:10:11.303954 | finish at 2025-09-10 11:45:14 + [2025-09-09 22:35:08] iteration 3501/ 11920 | consumed samples: 3585024 | elapsed time per iteration (ms): 5628.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.989370E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:09:46.096488 | finish at 2025-09-10 11:44:54 + [2025-09-09 22:35:14] iteration 3502/ 11920 | consumed samples: 3586048 | elapsed time per iteration (ms): 5837.2 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.000705E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:38:57.161628 | finish at 2025-09-10 12:14:11 + [2025-09-09 22:35:20] iteration 3503/ 11920 | consumed samples: 3587072 | elapsed time per iteration (ms): 5632.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.997226E+00 | loss scale: 1.0 | grad norm: 0.132 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:10:10.190792 | finish at 2025-09-10 11:45:30 + [2025-09-09 22:35:26] iteration 3504/ 11920 | consumed samples: 3588096 | elapsed time per iteration (ms): 5842.7 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.981352E+00 | loss scale: 1.0 | grad norm: 0.100 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:39:32.313728 | finish at 2025-09-10 12:14:58 + [2025-09-09 22:35:31] iteration 3505/ 11920 | consumed samples: 3589120 | elapsed time per iteration (ms): 5639.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.987972E+00 | loss scale: 1.0 | grad norm: 0.095 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:10:53.163557 | finish at 2025-09-10 11:46:24 + [2025-09-09 22:35:37] iteration 3506/ 11920 | consumed samples: 3590144 | elapsed time per iteration (ms): 5971.6 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.985298E+00 | loss scale: 1.0 | grad norm: 0.099 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:57:24.958614 | finish at 2025-09-10 12:33:02 + [2025-09-09 22:35:43] iteration 3507/ 11920 | consumed samples: 3591168 | elapsed time per iteration (ms): 5639.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.983384E+00 | loss scale: 1.0 | grad norm: 0.099 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:10:44.181983 | finish at 2025-09-10 11:46:27 + [2025-09-09 22:35:48] iteration 3508/ 11920 | consumed samples: 3592192 | elapsed time per iteration (ms): 5634.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.989480E+00 | loss scale: 1.0 | grad norm: 0.122 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:09:54.435943 | finish at 2025-09-10 11:45:43 + [2025-09-09 22:35:54] iteration 3509/ 11920 | consumed samples: 3593216 | elapsed time per iteration (ms): 5637.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.991389E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:10:18.833749 | finish at 2025-09-10 11:46:13 + [2025-09-09 22:36:00] iteration 3510/ 11920 | consumed samples: 3594240 | elapsed time per iteration (ms): 5627.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.983852E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:08:49.206393 | finish at 2025-09-10 11:44:49 + [2025-09-09 22:36:05] iteration 3511/ 11920 | consumed samples: 3595264 | elapsed time per iteration (ms): 5636.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.990315E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 10.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:09:59.833583 | finish at 2025-09-10 11:46:05 + [2025-09-09 22:36:11] iteration 3512/ 11920 | consumed samples: 3596288 | elapsed time per iteration (ms): 5638.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.974572E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:10:12.033924 | finish at 2025-09-10 11:46:23 + [2025-09-09 22:36:17] iteration 3513/ 11920 | consumed samples: 3597312 | elapsed time per iteration (ms): 5635.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.993987E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:09:37.092901 | finish at 2025-09-10 11:45:54 + [2025-09-09 22:36:23] iteration 3514/ 11920 | consumed samples: 3598336 | elapsed time per iteration (ms): 5913.5 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.994210E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:48:28.460722 | finish at 2025-09-10 12:24:51 + [2025-09-09 22:36:28] iteration 3515/ 11920 | consumed samples: 3599360 | elapsed time per iteration (ms): 5640.2 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.990447E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:10:05.876149 | finish at 2025-09-10 11:46:34 + [2025-09-09 22:36:34] iteration 3516/ 11920 | consumed samples: 3600384 | elapsed time per iteration (ms): 5637.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.983099E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:09:40.894526 | finish at 2025-09-10 11:46:15 + [2025-09-09 22:36:40] iteration 3517/ 11920 | consumed samples: 3601408 | elapsed time per iteration (ms): 6240.6 | throughput per GPU (TFLOP/s/GPU): 72.3 | MFU 7.32% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.970045E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:33:59.874909 | finish at 2025-09-10 13:10:40 + [2025-09-09 22:36:46] iteration 3518/ 11920 | consumed samples: 3602432 | elapsed time per iteration (ms): 5633.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.992102E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:08:55.522383 | finish at 2025-09-10 11:45:41 + [2025-09-09 22:36:51] iteration 3519/ 11920 | consumed samples: 3603456 | elapsed time per iteration (ms): 5650.4 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.976346E+00 | loss scale: 1.0 | grad norm: 0.246 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:11:09.414353 | finish at 2025-09-10 11:48:01 + [2025-09-09 22:36:57] iteration 3520/ 11920 | consumed samples: 3604480 | elapsed time per iteration (ms): 5638.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.980438E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:09:19.624672 | finish at 2025-09-10 11:46:17 + [2025-09-09 22:37:03] iteration 3521/ 11920 | consumed samples: 3605504 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.963472E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:07:35.044201 | finish at 2025-09-10 11:44:38 + [2025-09-09 22:37:08] iteration 3522/ 11920 | consumed samples: 3606528 | elapsed time per iteration (ms): 5627.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.973044E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:07:43.203348 | finish at 2025-09-10 11:44:51 + [2025-09-09 22:37:14] iteration 3523/ 11920 | consumed samples: 3607552 | elapsed time per iteration (ms): 5972.7 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.980385E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:55:52.466603 | finish at 2025-09-10 12:33:07 + [2025-09-09 22:37:20] iteration 3524/ 11920 | consumed samples: 3608576 | elapsed time per iteration (ms): 5952.7 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.987019E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:52:59.188640 | finish at 2025-09-10 12:30:19 + [2025-09-09 22:37:26] iteration 3525/ 11920 | consumed samples: 3609600 | elapsed time per iteration (ms): 5630.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.990899E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:07:47.559785 | finish at 2025-09-10 11:45:13 + [2025-09-09 22:37:32] iteration 3526/ 11920 | consumed samples: 3610624 | elapsed time per iteration (ms): 6049.3 | throughput per GPU (TFLOP/s/GPU): 74.6 | MFU 7.55% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.991448E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:06:17.631702 | finish at 2025-09-10 12:43:49 + [2025-09-09 22:37:38] iteration 3527/ 11920 | consumed samples: 3611648 | elapsed time per iteration (ms): 5868.3 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.974929E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:40:53.029777 | finish at 2025-09-10 12:18:31 + [2025-09-09 22:37:43] iteration 3528/ 11920 | consumed samples: 3612672 | elapsed time per iteration (ms): 5649.9 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.992036E+00 | loss scale: 1.0 | grad norm: 0.303 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:10:13.846415 | finish at 2025-09-10 11:47:57 + [2025-09-09 22:37:49] iteration 3529/ 11920 | consumed samples: 3613696 | elapsed time per iteration (ms): 5638.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.985820E+00 | loss scale: 1.0 | grad norm: 0.317 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:08:29.002251 | finish at 2025-09-10 11:46:18 + [2025-09-09 22:37:55] iteration 3530/ 11920 | consumed samples: 3614720 | elapsed time per iteration (ms): 5628.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.979161E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:07:05.279231 | finish at 2025-09-10 11:45:00 + [2025-09-09 22:38:00] iteration 3531/ 11920 | consumed samples: 3615744 | elapsed time per iteration (ms): 5627.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.979570E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:06:50.592050 | finish at 2025-09-10 11:44:51 + [2025-09-09 22:38:06] iteration 3532/ 11920 | consumed samples: 3616768 | elapsed time per iteration (ms): 5629.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.977981E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:07:03.932997 | finish at 2025-09-10 11:45:10 + [2025-09-09 22:38:12] iteration 3533/ 11920 | consumed samples: 3617792 | elapsed time per iteration (ms): 5631.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.985983E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:07:09.536903 | finish at 2025-09-10 11:45:21 + [2025-09-09 22:38:17] iteration 3534/ 11920 | consumed samples: 3618816 | elapsed time per iteration (ms): 5628.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.975207E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:06:36.390182 | finish at 2025-09-10 11:44:54 + [2025-09-09 22:38:23] iteration 3535/ 11920 | consumed samples: 3619840 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.981411E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:06:06.456642 | finish at 2025-09-10 11:44:29 + [2025-09-09 22:38:28] iteration 3536/ 11920 | consumed samples: 3620864 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.976392E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:05:54.834839 | finish at 2025-09-10 11:44:23 + [2025-09-09 22:38:34] iteration 3537/ 11920 | consumed samples: 3621888 | elapsed time per iteration (ms): 5638.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.975223E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:07:49.643888 | finish at 2025-09-10 11:46:24 + [2025-09-09 22:38:40] iteration 3538/ 11920 | consumed samples: 3622912 | elapsed time per iteration (ms): 5637.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.984225E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:07:34.658506 | finish at 2025-09-10 11:46:14 + [2025-09-09 22:38:45] iteration 3539/ 11920 | consumed samples: 3623936 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.983615E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:06:03.742284 | finish at 2025-09-10 11:44:49 + [2025-09-09 22:38:51] iteration 3540/ 11920 | consumed samples: 3624960 | elapsed time per iteration (ms): 5841.1 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.977352E+00 | loss scale: 1.0 | grad norm: 0.128 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:35:48.461766 | finish at 2025-09-10 12:14:40 + [2025-09-09 22:38:57] iteration 3541/ 11920 | consumed samples: 3625984 | elapsed time per iteration (ms): 5627.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.984620E+00 | loss scale: 1.0 | grad norm: 0.124 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:05:50.287886 | finish at 2025-09-10 11:44:47 + [2025-09-09 22:39:03] iteration 3542/ 11920 | consumed samples: 3627008 | elapsed time per iteration (ms): 5842.1 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.969059E+00 | loss scale: 1.0 | grad norm: 0.126 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:35:45.288782 | finish at 2025-09-10 12:14:48 + [2025-09-09 22:39:09] iteration 3543/ 11920 | consumed samples: 3628032 | elapsed time per iteration (ms): 6070.5 | throughput per GPU (TFLOP/s/GPU): 74.4 | MFU 7.52% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.977935E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:07:32.593615 | finish at 2025-09-10 12:46:41 + [2025-09-09 22:39:14] iteration 3544/ 11920 | consumed samples: 3629056 | elapsed time per iteration (ms): 5630.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.994681E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:05:58.097128 | finish at 2025-09-10 11:45:12 + [2025-09-09 22:39:20] iteration 3545/ 11920 | consumed samples: 3630080 | elapsed time per iteration (ms): 5852.3 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.991366E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:36:53.157666 | finish at 2025-09-10 12:16:13 + [2025-09-09 22:39:26] iteration 3546/ 11920 | consumed samples: 3631104 | elapsed time per iteration (ms): 5936.7 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.971212E+00 | loss scale: 1.0 | grad norm: 0.129 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:48:33.591270 | finish at 2025-09-10 12:28:00 + [2025-09-09 22:39:32] iteration 3547/ 11920 | consumed samples: 3632128 | elapsed time per iteration (ms): 5635.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.972455E+00 | loss scale: 1.0 | grad norm: 0.118 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:06:22.747257 | finish at 2025-09-10 11:45:54 + [2025-09-09 22:39:37] iteration 3548/ 11920 | consumed samples: 3633152 | elapsed time per iteration (ms): 5637.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.959430E+00 | loss scale: 1.0 | grad norm: 0.114 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:06:33.126382 | finish at 2025-09-10 11:46:11 + [2025-09-09 22:39:43] iteration 3549/ 11920 | consumed samples: 3634176 | elapsed time per iteration (ms): 5633.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.979133E+00 | loss scale: 1.0 | grad norm: 0.120 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:06:00.370406 | finish at 2025-09-10 11:45:43 + [2025-09-09 22:39:49] iteration 3550/ 11920 | consumed samples: 3635200 | elapsed time per iteration (ms): 5628.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.975417E+00 | loss scale: 1.0 | grad norm: 0.115 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:05:13.819592 | finish at 2025-09-10 11:45:02 + [2025-09-09 22:39:54] iteration 3551/ 11920 | consumed samples: 3636224 | elapsed time per iteration (ms): 5634.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.974980E+00 | loss scale: 1.0 | grad norm: 0.112 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:05:50.832793 | finish at 2025-09-10 11:45:45 + [2025-09-09 22:40:00] iteration 3552/ 11920 | consumed samples: 3637248 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.972075E+00 | loss scale: 1.0 | grad norm: 0.112 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:04:20.946297 | finish at 2025-09-10 11:44:21 + [2025-09-09 22:40:06] iteration 3553/ 11920 | consumed samples: 3638272 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.987810E+00 | loss scale: 1.0 | grad norm: 0.119 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:04:43.262223 | finish at 2025-09-10 11:44:49 + [2025-09-09 22:40:11] iteration 3554/ 11920 | consumed samples: 3639296 | elapsed time per iteration (ms): 5622.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.977814E+00 | loss scale: 1.0 | grad norm: 0.123 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:03:58.648322 | finish at 2025-09-10 11:44:10 + [2025-09-09 22:40:17] iteration 3555/ 11920 | consumed samples: 3640320 | elapsed time per iteration (ms): 5981.9 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.963306E+00 | loss scale: 1.0 | grad norm: 0.120 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:53:58.395863 | finish at 2025-09-10 12:34:16 + [2025-09-09 22:40:23] iteration 3556/ 11920 | consumed samples: 3641344 | elapsed time per iteration (ms): 5619.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.978424E+00 | loss scale: 1.0 | grad norm: 0.129 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:03:19.259929 | finish at 2025-09-10 11:43:42 + [2025-09-09 22:40:28] iteration 3557/ 11920 | consumed samples: 3642368 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.977198E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:03:33.850812 | finish at 2025-09-10 11:44:02 + [2025-09-09 22:40:34] iteration 3558/ 11920 | consumed samples: 3643392 | elapsed time per iteration (ms): 5630.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.968024E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:04:41.918682 | finish at 2025-09-10 11:45:16 + [2025-09-09 22:40:40] iteration 3559/ 11920 | consumed samples: 3644416 | elapsed time per iteration (ms): 5633.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.978301E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:05:03.749543 | finish at 2025-09-10 11:45:43 + [2025-09-09 22:40:46] iteration 3560/ 11920 | consumed samples: 3645440 | elapsed time per iteration (ms): 5994.0 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.968780E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:55:09.941301 | finish at 2025-09-10 12:35:56 + [2025-09-09 22:40:51] iteration 3561/ 11920 | consumed samples: 3646464 | elapsed time per iteration (ms): 5629.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.970894E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:04:17.798901 | finish at 2025-09-10 11:45:09 + [2025-09-09 22:40:57] iteration 3562/ 11920 | consumed samples: 3647488 | elapsed time per iteration (ms): 5629.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.968639E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:04:14.797678 | finish at 2025-09-10 11:45:12 + [2025-09-09 22:41:03] iteration 3563/ 11920 | consumed samples: 3648512 | elapsed time per iteration (ms): 5875.1 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.993572E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 10.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:38:17.946837 | finish at 2025-09-10 12:19:21 + [2025-09-09 22:41:08] iteration 3564/ 11920 | consumed samples: 3649536 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.970138E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:02:50.138287 | finish at 2025-09-10 11:43:59 + [2025-09-09 22:41:14] iteration 3565/ 11920 | consumed samples: 3650560 | elapsed time per iteration (ms): 5627.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.971498E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:03:33.757092 | finish at 2025-09-10 11:44:48 + [2025-09-09 22:41:20] iteration 3566/ 11920 | consumed samples: 3651584 | elapsed time per iteration (ms): 6131.8 | throughput per GPU (TFLOP/s/GPU): 73.6 | MFU 7.44% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.977927E+00 | loss scale: 1.0 | grad norm: 0.253 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:13:44.961064 | finish at 2025-09-10 12:55:05 + [2025-09-09 22:41:26] iteration 3567/ 11920 | consumed samples: 3652608 | elapsed time per iteration (ms): 5641.6 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.986359E+00 | loss scale: 1.0 | grad norm: 0.277 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:05:24.682213 | finish at 2025-09-10 11:46:50 + [2025-09-09 22:41:31] iteration 3568/ 11920 | consumed samples: 3653632 | elapsed time per iteration (ms): 5635.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.982526E+00 | loss scale: 1.0 | grad norm: 0.348 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:04:31.001129 | finish at 2025-09-10 11:46:02 + [2025-09-09 22:41:37] iteration 3569/ 11920 | consumed samples: 3654656 | elapsed time per iteration (ms): 5631.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.988053E+00 | loss scale: 1.0 | grad norm: 0.338 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:03:47.262824 | finish at 2025-09-10 11:45:24 + [2025-09-09 22:41:43] iteration 3570/ 11920 | consumed samples: 3655680 | elapsed time per iteration (ms): 5916.0 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.989133E+00 | loss scale: 1.0 | grad norm: 0.272 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:43:18.628938 | finish at 2025-09-10 12:25:02 + [2025-09-09 22:41:49] iteration 3571/ 11920 | consumed samples: 3656704 | elapsed time per iteration (ms): 5626.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.985650E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:02:58.034270 | finish at 2025-09-10 11:44:47 + [2025-09-09 22:41:54] iteration 3572/ 11920 | consumed samples: 3657728 | elapsed time per iteration (ms): 5640.6 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.968608E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:04:47.523514 | finish at 2025-09-10 11:46:42 + [2025-09-09 22:42:00] iteration 3573/ 11920 | consumed samples: 3658752 | elapsed time per iteration (ms): 5626.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.970137E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:02:40.635332 | finish at 2025-09-10 11:44:41 + [2025-09-09 22:42:06] iteration 3574/ 11920 | consumed samples: 3659776 | elapsed time per iteration (ms): 5632.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.977227E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:03:26.470562 | finish at 2025-09-10 11:45:32 + [2025-09-09 22:42:11] iteration 3575/ 11920 | consumed samples: 3660800 | elapsed time per iteration (ms): 5633.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.975030E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:03:33.910038 | finish at 2025-09-10 11:45:45 + [2025-09-09 22:42:17] iteration 3576/ 11920 | consumed samples: 3661824 | elapsed time per iteration (ms): 5634.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.994948E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:03:36.277481 | finish at 2025-09-10 11:45:53 +(min, max) time across ranks (ms): + save-checkpoint ................................: (6704.92, 6705.08) + [2025-09-09 22:42:29] iteration 3577/ 11920 | consumed samples: 3662848 | elapsed time per iteration (ms): 5941.4 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.979608E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:46:09.333354 | finish at 2025-09-10 12:28:39 + [2025-09-09 22:42:35] iteration 3578/ 11920 | consumed samples: 3663872 | elapsed time per iteration (ms): 5637.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.981847E+00 | loss scale: 1.0 | grad norm: 0.123 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:03:43.633934 | finish at 2025-09-10 11:46:19 + [2025-09-09 22:42:41] iteration 3579/ 11920 | consumed samples: 3664896 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.967218E+00 | loss scale: 1.0 | grad norm: 0.103 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:01:40.181417 | finish at 2025-09-10 11:44:21 + [2025-09-09 22:42:47] iteration 3580/ 11920 | consumed samples: 3665920 | elapsed time per iteration (ms): 5875.4 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.986848E+00 | loss scale: 1.0 | grad norm: 0.107 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:36:41.228271 | finish at 2025-09-10 12:19:28 + [2025-09-09 22:42:52] iteration 3581/ 11920 | consumed samples: 3666944 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.962256E+00 | loss scale: 1.0 | grad norm: 0.108 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:01:21.619244 | finish at 2025-09-10 11:44:14 + [2025-09-09 22:42:58] iteration 3582/ 11920 | consumed samples: 3667968 | elapsed time per iteration (ms): 5618.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.959381E+00 | loss scale: 1.0 | grad norm: 0.093 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:00:43.904065 | finish at 2025-09-10 11:43:42 + [2025-09-09 22:43:04] iteration 3583/ 11920 | consumed samples: 3668992 | elapsed time per iteration (ms): 5863.1 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.975451E+00 | loss scale: 1.0 | grad norm: 0.088 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:34:40.541895 | finish at 2025-09-10 12:17:44 + [2025-09-09 22:43:09] iteration 3584/ 11920 | consumed samples: 3670016 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.968458E+00 | loss scale: 1.0 | grad norm: 0.113 | num zeros: 13.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:00:43.982414 | finish at 2025-09-10 11:43:53 + [2025-09-09 22:43:15] iteration 3585/ 11920 | consumed samples: 3671040 | elapsed time per iteration (ms): 5619.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.977018E+00 | loss scale: 1.0 | grad norm: 0.128 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:00:35.491403 | finish at 2025-09-10 11:43:50 + [2025-09-09 22:43:21] iteration 3586/ 11920 | consumed samples: 3672064 | elapsed time per iteration (ms): 5638.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.973346E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:03:11.228989 | finish at 2025-09-10 11:46:32 + [2025-09-09 22:43:26] iteration 3587/ 11920 | consumed samples: 3673088 | elapsed time per iteration (ms): 5627.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.953929E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:01:33.528841 | finish at 2025-09-10 11:45:00 + [2025-09-09 22:43:32] iteration 3588/ 11920 | consumed samples: 3674112 | elapsed time per iteration (ms): 5640.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.970476E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:03:13.448301 | finish at 2025-09-10 11:46:45 + [2025-09-09 22:43:37] iteration 3589/ 11920 | consumed samples: 3675136 | elapsed time per iteration (ms): 5628.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.978474E+00 | loss scale: 1.0 | grad norm: 0.273 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:01:31.297546 | finish at 2025-09-10 11:45:09 + [2025-09-09 22:43:43] iteration 3590/ 11920 | consumed samples: 3676160 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.984178E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:00:58.188362 | finish at 2025-09-10 11:44:41 + [2025-09-09 22:43:49] iteration 3591/ 11920 | consumed samples: 3677184 | elapsed time per iteration (ms): 5636.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.966522E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:02:24.348253 | finish at 2025-09-10 11:46:13 + [2025-09-09 22:43:54] iteration 3592/ 11920 | consumed samples: 3678208 | elapsed time per iteration (ms): 5630.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.967002E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:01:33.385866 | finish at 2025-09-10 11:45:28 + [2025-09-09 22:44:00] iteration 3593/ 11920 | consumed samples: 3679232 | elapsed time per iteration (ms): 5631.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.983082E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:01:33.286134 | finish at 2025-09-10 11:45:33 + [2025-09-09 22:44:06] iteration 3594/ 11920 | consumed samples: 3680256 | elapsed time per iteration (ms): 5854.6 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.963530E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:32:25.260251 | finish at 2025-09-10 12:16:31 + [2025-09-09 22:44:11] iteration 3595/ 11920 | consumed samples: 3681280 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.968757E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:00:29.500490 | finish at 2025-09-10 11:44:41 + [2025-09-09 22:44:17] iteration 3596/ 11920 | consumed samples: 3682304 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.976100E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:00:24.607641 | finish at 2025-09-10 11:44:42 + [2025-09-09 22:44:23] iteration 3597/ 11920 | consumed samples: 3683328 | elapsed time per iteration (ms): 5637.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.978154E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:02:04.238681 | finish at 2025-09-10 11:46:27 + [2025-09-09 22:44:28] iteration 3598/ 11920 | consumed samples: 3684352 | elapsed time per iteration (ms): 5632.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.969356E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:01:12.989844 | finish at 2025-09-10 11:45:41 + [2025-09-09 22:44:34] iteration 3599/ 11920 | consumed samples: 3685376 | elapsed time per iteration (ms): 5651.5 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.968944E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:03:46.016324 | finish at 2025-09-10 11:48:20 + [2025-09-09 22:44:40] iteration 3600/ 11920 | consumed samples: 3686400 | elapsed time per iteration (ms): 5639.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.955361E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:01:57.687531 | finish at 2025-09-10 11:46:37 + [2025-09-09 22:44:45] iteration 3601/ 11920 | consumed samples: 3687424 | elapsed time per iteration (ms): 5862.8 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.981940E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:32:52.659993 | finish at 2025-09-10 12:17:38 + [2025-09-09 22:44:51] iteration 3602/ 11920 | consumed samples: 3688448 | elapsed time per iteration (ms): 5632.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.987768E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:00:54.232150 | finish at 2025-09-10 11:45:45 + [2025-09-09 22:44:57] iteration 3603/ 11920 | consumed samples: 3689472 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.965614E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:59:28.802316 | finish at 2025-09-10 11:44:26 + [2025-09-09 22:45:02] iteration 3604/ 11920 | consumed samples: 3690496 | elapsed time per iteration (ms): 5630.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.970141E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:00:19.515161 | finish at 2025-09-10 11:45:22 + [2025-09-09 22:45:08] iteration 3605/ 11920 | consumed samples: 3691520 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.979249E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:59:32.868208 | finish at 2025-09-10 11:44:41 + [2025-09-09 22:45:14] iteration 3606/ 11920 | consumed samples: 3692544 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.959017E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:58:44.718693 | finish at 2025-09-10 11:43:58 + [2025-09-09 22:45:19] iteration 3607/ 11920 | consumed samples: 3693568 | elapsed time per iteration (ms): 5623.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.970387E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:59:03.780206 | finish at 2025-09-10 11:44:23 + [2025-09-09 22:45:25] iteration 3608/ 11920 | consumed samples: 3694592 | elapsed time per iteration (ms): 5628.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.986465E+00 | loss scale: 1.0 | grad norm: 0.263 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:59:44.997526 | finish at 2025-09-10 11:45:10 + [2025-09-09 22:45:31] iteration 3609/ 11920 | consumed samples: 3695616 | elapsed time per iteration (ms): 5863.3 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.981947E+00 | loss scale: 1.0 | grad norm: 0.258 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:32:09.504578 | finish at 2025-09-10 12:17:40 + [2025-09-09 22:45:37] iteration 3610/ 11920 | consumed samples: 3696640 | elapsed time per iteration (ms): 6207.3 | throughput per GPU (TFLOP/s/GPU): 72.7 | MFU 7.35% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.971677E+00 | loss scale: 1.0 | grad norm: 0.255 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:19:42.975605 | finish at 2025-09-10 13:05:20 + [2025-09-09 22:45:43] iteration 3611/ 11920 | consumed samples: 3697664 | elapsed time per iteration (ms): 5935.3 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.978337E+00 | loss scale: 1.0 | grad norm: 0.251 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:41:56.337312 | finish at 2025-09-10 12:27:39 + [2025-09-09 22:45:49] iteration 3612/ 11920 | consumed samples: 3698688 | elapsed time per iteration (ms): 5634.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.964583E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:00:12.975193 | finish at 2025-09-10 11:46:01 + [2025-09-09 22:45:54] iteration 3613/ 11920 | consumed samples: 3699712 | elapsed time per iteration (ms): 5631.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.968640E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:59:42.431216 | finish at 2025-09-10 11:45:37 + [2025-09-09 22:46:00] iteration 3614/ 11920 | consumed samples: 3700736 | elapsed time per iteration (ms): 5633.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.972834E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:59:48.208063 | finish at 2025-09-10 11:45:48 + [2025-09-09 22:46:05] iteration 3615/ 11920 | consumed samples: 3701760 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.969591E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:58:56.088985 | finish at 2025-09-10 11:45:02 + [2025-09-09 22:46:11] iteration 3616/ 11920 | consumed samples: 3702784 | elapsed time per iteration (ms): 5637.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.971031E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:00:10.612873 | finish at 2025-09-10 11:46:22 + [2025-09-09 22:46:17] iteration 3617/ 11920 | consumed samples: 3703808 | elapsed time per iteration (ms): 5625.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.970293E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:58:28.057036 | finish at 2025-09-10 11:44:45 + [2025-09-09 22:46:23] iteration 3618/ 11920 | consumed samples: 3704832 | elapsed time per iteration (ms): 5854.0 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.958506E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:29:59.776144 | finish at 2025-09-10 12:16:22 + [2025-09-09 22:46:28] iteration 3619/ 11920 | consumed samples: 3705856 | elapsed time per iteration (ms): 5890.5 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.966047E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:34:57.357728 | finish at 2025-09-10 12:21:26 + [2025-09-09 22:46:34] iteration 3620/ 11920 | consumed samples: 3706880 | elapsed time per iteration (ms): 5626.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.980329E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:58:17.707033 | finish at 2025-09-10 11:44:52 + [2025-09-09 22:46:40] iteration 3621/ 11920 | consumed samples: 3707904 | elapsed time per iteration (ms): 5930.7 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.971139E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:40:18.921380 | finish at 2025-09-10 12:26:59 + [2025-09-09 22:46:46] iteration 3622/ 11920 | consumed samples: 3708928 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.963441E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:57:47.256206 | finish at 2025-09-10 11:44:33 + [2025-09-09 22:46:51] iteration 3623/ 11920 | consumed samples: 3709952 | elapsed time per iteration (ms): 5870.5 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.960788E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:31:47.218766 | finish at 2025-09-10 12:18:39 + [2025-09-09 22:46:57] iteration 3624/ 11920 | consumed samples: 3710976 | elapsed time per iteration (ms): 5644.6 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.979464E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:00:27.642427 | finish at 2025-09-10 11:47:25 + [2025-09-09 22:47:03] iteration 3625/ 11920 | consumed samples: 3712000 | elapsed time per iteration (ms): 5632.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.968003E+00 | loss scale: 1.0 | grad norm: 0.116 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:58:45.059756 | finish at 2025-09-10 11:45:48 + [2025-09-09 22:47:08] iteration 3626/ 11920 | consumed samples: 3713024 | elapsed time per iteration (ms): 5618.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.964182E+00 | loss scale: 1.0 | grad norm: 0.122 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:56:41.383336 | finish at 2025-09-10 11:43:50 + [2025-09-09 22:47:15] iteration 3627/ 11920 | consumed samples: 3714048 | elapsed time per iteration (ms): 6459.8 | throughput per GPU (TFLOP/s/GPU): 69.9 | MFU 7.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.969748E+00 | loss scale: 1.0 | grad norm: 0.130 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:52:51.070034 | finish at 2025-09-10 13:40:06 + [2025-09-09 22:47:20] iteration 3628/ 11920 | consumed samples: 3715072 | elapsed time per iteration (ms): 5631.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.969819E+00 | loss scale: 1.0 | grad norm: 0.133 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:58:16.441541 | finish at 2025-09-10 11:45:37 + [2025-09-09 22:47:26] iteration 3629/ 11920 | consumed samples: 3716096 | elapsed time per iteration (ms): 5616.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.949416E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:56:06.459980 | finish at 2025-09-10 11:43:33 + [2025-09-09 22:47:32] iteration 3630/ 11920 | consumed samples: 3717120 | elapsed time per iteration (ms): 5624.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.948912E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:57:08.164699 | finish at 2025-09-10 11:44:40 + [2025-09-09 22:47:37] iteration 3631/ 11920 | consumed samples: 3718144 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.959950E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:56:48.152960 | finish at 2025-09-10 11:44:25 + [2025-09-09 22:47:43] iteration 3632/ 11920 | consumed samples: 3719168 | elapsed time per iteration (ms): 5617.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.986776E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:55:57.143021 | finish at 2025-09-10 11:43:40 + [2025-09-09 22:47:49] iteration 3633/ 11920 | consumed samples: 3720192 | elapsed time per iteration (ms): 5630.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.962379E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:57:38.966261 | finish at 2025-09-10 11:45:28 + [2025-09-09 22:47:55] iteration 3634/ 11920 | consumed samples: 3721216 | elapsed time per iteration (ms): 6010.9 | throughput per GPU (TFLOP/s/GPU): 75.1 | MFU 7.59% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.971190E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:50:06.011363 | finish at 2025-09-10 12:38:01 + [2025-09-09 22:48:00] iteration 3635/ 11920 | consumed samples: 3722240 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.969985E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:56:32.324071 | finish at 2025-09-10 11:44:33 + [2025-09-09 22:48:06] iteration 3636/ 11920 | consumed samples: 3723264 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.963430E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:56:41.997211 | finish at 2025-09-10 11:44:48 + [2025-09-09 22:48:11] iteration 3637/ 11920 | consumed samples: 3724288 | elapsed time per iteration (ms): 5638.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.969565E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:58:21.092480 | finish at 2025-09-10 11:46:33 + [2025-09-09 22:48:17] iteration 3638/ 11920 | consumed samples: 3725312 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.971076E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:56:08.399773 | finish at 2025-09-10 11:44:25 + [2025-09-09 22:48:23] iteration 3639/ 11920 | consumed samples: 3726336 | elapsed time per iteration (ms): 5617.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.956517E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:55:18.450925 | finish at 2025-09-10 11:43:41 + [2025-09-09 22:48:28] iteration 3640/ 11920 | consumed samples: 3727360 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.966288E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:56:19.879990 | finish at 2025-09-10 11:44:48 + [2025-09-09 22:48:34] iteration 3641/ 11920 | consumed samples: 3728384 | elapsed time per iteration (ms): 5628.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.962023E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:56:35.753764 | finish at 2025-09-10 11:45:10 + [2025-09-09 22:48:40] iteration 3642/ 11920 | consumed samples: 3729408 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.974788E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:55:35.929726 | finish at 2025-09-10 11:44:16 + [2025-09-09 22:48:46] iteration 3643/ 11920 | consumed samples: 3730432 | elapsed time per iteration (ms): 5982.3 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.963665E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:45:15.106725 | finish at 2025-09-10 12:34:01 + [2025-09-09 22:48:51] iteration 3644/ 11920 | consumed samples: 3731456 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.964273E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:55:14.948941 | finish at 2025-09-10 11:44:06 + [2025-09-09 22:48:57] iteration 3645/ 11920 | consumed samples: 3732480 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.950113E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:55:05.564159 | finish at 2025-09-10 11:44:02 + [2025-09-09 22:49:02] iteration 3646/ 11920 | consumed samples: 3733504 | elapsed time per iteration (ms): 5642.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.969118E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:58:04.272906 | finish at 2025-09-10 11:47:07 + [2025-09-09 22:49:08] iteration 3647/ 11920 | consumed samples: 3734528 | elapsed time per iteration (ms): 5632.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.959782E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:56:40.924496 | finish at 2025-09-10 11:45:49 + [2025-09-09 22:49:14] iteration 3648/ 11920 | consumed samples: 3735552 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.967454E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:55:05.765625 | finish at 2025-09-10 11:44:19 + [2025-09-09 22:49:19] iteration 3649/ 11920 | consumed samples: 3736576 | elapsed time per iteration (ms): 5635.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.954428E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:56:48.788695 | finish at 2025-09-10 11:46:08 + [2025-09-09 22:49:25] iteration 3650/ 11920 | consumed samples: 3737600 | elapsed time per iteration (ms): 5983.1 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.960134E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:44:39.946640 | finish at 2025-09-10 12:34:05 + [2025-09-09 22:49:31] iteration 3651/ 11920 | consumed samples: 3738624 | elapsed time per iteration (ms): 5649.4 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.964847E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:58:35.223717 | finish at 2025-09-10 11:48:06 + [2025-09-09 22:49:37] iteration 3652/ 11920 | consumed samples: 3739648 | elapsed time per iteration (ms): 5636.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.977028E+00 | loss scale: 1.0 | grad norm: 0.238 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:56:45.920311 | finish at 2025-09-10 11:46:23 + [2025-09-09 22:49:42] iteration 3653/ 11920 | consumed samples: 3740672 | elapsed time per iteration (ms): 5846.1 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.971658E+00 | loss scale: 1.0 | grad norm: 0.243 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:25:29.849401 | finish at 2025-09-10 12:15:12 + [2025-09-09 22:49:48] iteration 3654/ 11920 | consumed samples: 3741696 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.965714E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:54:56.983126 | finish at 2025-09-10 11:44:45 + [2025-09-09 22:49:54] iteration 3655/ 11920 | consumed samples: 3742720 | elapsed time per iteration (ms): 5638.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.980060E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:56:39.067183 | finish at 2025-09-10 11:46:33 + [2025-09-09 22:50:00] iteration 3656/ 11920 | consumed samples: 3743744 | elapsed time per iteration (ms): 5979.1 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.959855E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:43:31.168032 | finish at 2025-09-10 12:33:31 + [2025-09-09 22:50:05] iteration 3657/ 11920 | consumed samples: 3744768 | elapsed time per iteration (ms): 5630.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.964480E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:55:21.317422 | finish at 2025-09-10 11:45:27 + [2025-09-09 22:50:11] iteration 3658/ 11920 | consumed samples: 3745792 | elapsed time per iteration (ms): 5628.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.969525E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:55:04.061502 | finish at 2025-09-10 11:45:15 + [2025-09-09 22:50:17] iteration 3659/ 11920 | consumed samples: 3746816 | elapsed time per iteration (ms): 5618.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.962808E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:53:32.907940 | finish at 2025-09-10 11:43:49 + [2025-09-09 22:50:22] iteration 3660/ 11920 | consumed samples: 3747840 | elapsed time per iteration (ms): 5628.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.967888E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:54:52.268505 | finish at 2025-09-10 11:45:14 + [2025-09-09 22:50:28] iteration 3661/ 11920 | consumed samples: 3748864 | elapsed time per iteration (ms): 5977.7 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.966655E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:42:49.993602 | finish at 2025-09-10 12:33:18 + [2025-09-09 22:50:34] iteration 3662/ 11920 | consumed samples: 3749888 | elapsed time per iteration (ms): 5634.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.968352E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:55:32.847460 | finish at 2025-09-10 11:46:07 + [2025-09-09 22:50:39] iteration 3663/ 11920 | consumed samples: 3750912 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.961927E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:54:15.086198 | finish at 2025-09-10 11:44:55 + [2025-09-09 22:50:45] iteration 3664/ 11920 | consumed samples: 3751936 | elapsed time per iteration (ms): 5631.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.970667E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:54:55.354889 | finish at 2025-09-10 11:45:40 + [2025-09-09 22:50:51] iteration 3665/ 11920 | consumed samples: 3752960 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.961311E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:53:29.584241 | finish at 2025-09-10 11:44:20 + [2025-09-09 22:50:56] iteration 3666/ 11920 | consumed samples: 3753984 | elapsed time per iteration (ms): 5627.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.960856E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:54:12.046082 | finish at 2025-09-10 11:45:08 + [2025-09-09 22:51:02] iteration 3667/ 11920 | consumed samples: 3755008 | elapsed time per iteration (ms): 5842.5 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.959682E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:23:38.219165 | finish at 2025-09-10 12:14:40 + [2025-09-09 22:51:08] iteration 3668/ 11920 | consumed samples: 3756032 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.964349E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:52:56.604998 | finish at 2025-09-10 11:44:04 + [2025-09-09 22:51:13] iteration 3669/ 11920 | consumed samples: 3757056 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.967896E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:53:23.618696 | finish at 2025-09-10 11:44:37 + [2025-09-09 22:51:19] iteration 3670/ 11920 | consumed samples: 3758080 | elapsed time per iteration (ms): 5626.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.963047E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:53:38.466747 | finish at 2025-09-10 11:44:58 + [2025-09-09 22:51:25] iteration 3671/ 11920 | consumed samples: 3759104 | elapsed time per iteration (ms): 5626.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.951695E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:53:32.665228 | finish at 2025-09-10 11:44:57 + [2025-09-09 22:51:30] iteration 3672/ 11920 | consumed samples: 3760128 | elapsed time per iteration (ms): 5628.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.960448E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:53:44.247404 | finish at 2025-09-10 11:45:15 + [2025-09-09 22:51:36] iteration 3673/ 11920 | consumed samples: 3761152 | elapsed time per iteration (ms): 5623.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.954780E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:52:52.473218 | finish at 2025-09-10 11:44:28 + [2025-09-09 22:51:42] iteration 3674/ 11920 | consumed samples: 3762176 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.970669E+00 | loss scale: 1.0 | grad norm: 0.134 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:52:41.461462 | finish at 2025-09-10 11:44:23 + [2025-09-09 22:51:47] iteration 3675/ 11920 | consumed samples: 3763200 | elapsed time per iteration (ms): 5628.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.963119E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:53:28.920615 | finish at 2025-09-10 11:45:16 + [2025-09-09 22:51:53] iteration 3676/ 11920 | consumed samples: 3764224 | elapsed time per iteration (ms): 5989.8 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.956012E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:43:00.278558 | finish at 2025-09-10 12:34:53 + [2025-09-09 22:51:59] iteration 3677/ 11920 | consumed samples: 3765248 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.972590E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:52:39.414781 | finish at 2025-09-10 11:44:38 + [2025-09-09 22:52:04] iteration 3678/ 11920 | consumed samples: 3766272 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.973583E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:52:41.743227 | finish at 2025-09-10 11:44:46 + [2025-09-09 22:52:10] iteration 3679/ 11920 | consumed samples: 3767296 | elapsed time per iteration (ms): 5637.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.952943E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:54:18.687013 | finish at 2025-09-10 11:46:29 + [2025-09-09 22:52:16] iteration 3680/ 11920 | consumed samples: 3768320 | elapsed time per iteration (ms): 5629.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.971813E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:53:08.362141 | finish at 2025-09-10 11:45:24 + [2025-09-09 22:52:21] iteration 3681/ 11920 | consumed samples: 3769344 | elapsed time per iteration (ms): 5628.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.961971E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:52:51.899201 | finish at 2025-09-10 11:45:13 + [2025-09-09 22:52:27] iteration 3682/ 11920 | consumed samples: 3770368 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.957586E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:52:10.536166 | finish at 2025-09-10 11:44:37 + [2025-09-09 22:52:33] iteration 3683/ 11920 | consumed samples: 3771392 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.974838E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:52:06.650174 | finish at 2025-09-10 11:44:39 + [2025-09-09 22:52:38] iteration 3684/ 11920 | consumed samples: 3772416 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.962720E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:51:46.489315 | finish at 2025-09-10 11:44:25 + [2025-09-09 22:52:44] iteration 3685/ 11920 | consumed samples: 3773440 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.961222E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:51:59.960707 | finish at 2025-09-10 11:44:44 + [2025-09-09 22:52:49] iteration 3686/ 11920 | consumed samples: 3774464 | elapsed time per iteration (ms): 5628.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.950207E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:52:23.735903 | finish at 2025-09-10 11:45:13 + [2025-09-09 22:52:55] iteration 3687/ 11920 | consumed samples: 3775488 | elapsed time per iteration (ms): 5631.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.947428E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:52:47.044639 | finish at 2025-09-10 11:45:42 + [2025-09-09 22:53:01] iteration 3688/ 11920 | consumed samples: 3776512 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.964762E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:51:38.998180 | finish at 2025-09-10 11:44:40 + [2025-09-09 22:53:06] iteration 3689/ 11920 | consumed samples: 3777536 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.963388E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:51:44.422353 | finish at 2025-09-10 11:44:51 + [2025-09-09 22:53:12] iteration 3690/ 11920 | consumed samples: 3778560 | elapsed time per iteration (ms): 5638.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.963648E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:53:23.071170 | finish at 2025-09-10 11:46:35 + [2025-09-09 22:53:18] iteration 3691/ 11920 | consumed samples: 3779584 | elapsed time per iteration (ms): 5633.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.938776E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:52:41.317376 | finish at 2025-09-10 11:45:59 + [2025-09-09 22:53:23] iteration 3692/ 11920 | consumed samples: 3780608 | elapsed time per iteration (ms): 5892.1 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.950778E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:27:59.928872 | finish at 2025-09-10 12:21:23 + [2025-09-09 22:53:29] iteration 3693/ 11920 | consumed samples: 3781632 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.961545E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:50:38.563575 | finish at 2025-09-10 11:44:08 + [2025-09-09 22:53:35] iteration 3694/ 11920 | consumed samples: 3782656 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.962429E+00 | loss scale: 1.0 | grad norm: 0.247 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:51:10.330183 | finish at 2025-09-10 11:44:45 + [2025-09-09 22:53:41] iteration 3695/ 11920 | consumed samples: 3783680 | elapsed time per iteration (ms): 5912.0 | throughput per GPU (TFLOP/s/GPU): 76.4 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.959532E+00 | loss scale: 1.0 | grad norm: 0.294 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:30:26.015180 | finish at 2025-09-10 12:24:07 + [2025-09-09 22:53:46] iteration 3696/ 11920 | consumed samples: 3784704 | elapsed time per iteration (ms): 5634.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.961277E+00 | loss scale: 1.0 | grad norm: 0.274 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:52:16.669418 | finish at 2025-09-10 11:46:03 + [2025-09-09 22:53:52] iteration 3697/ 11920 | consumed samples: 3785728 | elapsed time per iteration (ms): 5626.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.957380E+00 | loss scale: 1.0 | grad norm: 0.257 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:51:07.765324 | finish at 2025-09-10 11:45:00 + [2025-09-09 22:53:58] iteration 3698/ 11920 | consumed samples: 3786752 | elapsed time per iteration (ms): 5984.2 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.956925E+00 | loss scale: 1.0 | grad norm: 0.244 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:40:02.447217 | finish at 2025-09-10 12:34:00 + [2025-09-09 22:54:04] iteration 3699/ 11920 | consumed samples: 3787776 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.961006E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:50:30.116220 | finish at 2025-09-10 11:44:34 + [2025-09-09 22:54:09] iteration 3700/ 11920 | consumed samples: 3788800 | elapsed time per iteration (ms): 5852.4 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.966099E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:21:46.507072 | finish at 2025-09-10 12:15:56 + [2025-09-09 22:54:15] iteration 3701/ 11920 | consumed samples: 3789824 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.966033E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:49:59.540262 | finish at 2025-09-10 11:44:15 + [2025-09-09 22:54:21] iteration 3702/ 11920 | consumed samples: 3790848 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.960509E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:50:35.072835 | finish at 2025-09-10 11:44:56 + [2025-09-09 22:54:26] iteration 3703/ 11920 | consumed samples: 3791872 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.960373E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:49:50.915469 | finish at 2025-09-10 11:44:17 + [2025-09-09 22:54:32] iteration 3704/ 11920 | consumed samples: 3792896 | elapsed time per iteration (ms): 5637.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.952866E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:51:58.103914 | finish at 2025-09-10 11:46:30 + [2025-09-09 22:54:37] iteration 3705/ 11920 | consumed samples: 3793920 | elapsed time per iteration (ms): 5626.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.956112E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:50:21.103148 | finish at 2025-09-10 11:44:59 + [2025-09-09 22:54:43] iteration 3706/ 11920 | consumed samples: 3794944 | elapsed time per iteration (ms): 5641.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.971810E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:52:16.261162 | finish at 2025-09-10 11:46:59 + [2025-09-09 22:54:49] iteration 3707/ 11920 | consumed samples: 3795968 | elapsed time per iteration (ms): 5627.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.950637E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:50:14.606595 | finish at 2025-09-10 11:45:03 + [2025-09-09 22:54:54] iteration 3708/ 11920 | consumed samples: 3796992 | elapsed time per iteration (ms): 5629.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.965297E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:50:28.920732 | finish at 2025-09-10 11:45:23 + [2025-09-09 22:55:00] iteration 3709/ 11920 | consumed samples: 3798016 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.952340E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:49:07.473279 | finish at 2025-09-10 11:44:07 + [2025-09-09 22:55:06] iteration 3710/ 11920 | consumed samples: 3799040 | elapsed time per iteration (ms): 6014.4 | throughput per GPU (TFLOP/s/GPU): 75.1 | MFU 7.59% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.965939E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:42:58.496125 | finish at 2025-09-10 12:38:05 + [2025-09-09 22:55:12] iteration 3711/ 11920 | consumed samples: 3800064 | elapsed time per iteration (ms): 5982.5 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.965535E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:38:30.049550 | finish at 2025-09-10 12:33:42 + [2025-09-09 22:55:18] iteration 3712/ 11920 | consumed samples: 3801088 | elapsed time per iteration (ms): 6455.4 | throughput per GPU (TFLOP/s/GPU): 69.9 | MFU 7.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.967526E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:43:06.267540 | finish at 2025-09-10 13:38:25 + [2025-09-09 22:55:24] iteration 3713/ 11920 | consumed samples: 3802112 | elapsed time per iteration (ms): 5633.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.953304E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:50:32.270578 | finish at 2025-09-10 11:45:56 + [2025-09-09 22:55:30] iteration 3714/ 11920 | consumed samples: 3803136 | elapsed time per iteration (ms): 5628.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.949366E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:49:46.013310 | finish at 2025-09-10 11:45:16 + [2025-09-09 22:55:35] iteration 3715/ 11920 | consumed samples: 3804160 | elapsed time per iteration (ms): 5626.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.964011E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:49:26.311909 | finish at 2025-09-10 11:45:02 + [2025-09-09 22:55:41] iteration 3716/ 11920 | consumed samples: 3805184 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.960775E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:48:42.938684 | finish at 2025-09-10 11:44:24 + [2025-09-09 22:55:47] iteration 3717/ 11920 | consumed samples: 3806208 | elapsed time per iteration (ms): 5633.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.953854E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:50:09.154673 | finish at 2025-09-10 11:45:56 + [2025-09-09 22:55:52] iteration 3718/ 11920 | consumed samples: 3807232 | elapsed time per iteration (ms): 5628.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.966127E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:49:21.961035 | finish at 2025-09-10 11:45:14 + [2025-09-09 22:55:58] iteration 3719/ 11920 | consumed samples: 3808256 | elapsed time per iteration (ms): 5637.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.969185E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:50:29.287963 | finish at 2025-09-10 11:46:27 + [2025-09-09 22:56:04] iteration 3720/ 11920 | consumed samples: 3809280 | elapsed time per iteration (ms): 5631.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.958295E+00 | loss scale: 1.0 | grad norm: 0.255 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:49:41.115294 | finish at 2025-09-10 11:45:45 + [2025-09-09 22:56:09] iteration 3721/ 11920 | consumed samples: 3810304 | elapsed time per iteration (ms): 5627.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.962676E+00 | loss scale: 1.0 | grad norm: 0.246 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:48:59.161424 | finish at 2025-09-10 11:45:08 + [2025-09-09 22:56:15] iteration 3722/ 11920 | consumed samples: 3811328 | elapsed time per iteration (ms): 5615.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.963921E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:47:15.053731 | finish at 2025-09-10 11:43:30 + [2025-09-09 22:56:20] iteration 3723/ 11920 | consumed samples: 3812352 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.957345E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:48:23.250933 | finish at 2025-09-10 11:44:44 + [2025-09-09 22:56:26] iteration 3724/ 11920 | consumed samples: 3813376 | elapsed time per iteration (ms): 5628.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.960987E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:48:47.512207 | finish at 2025-09-10 11:45:14 + [2025-09-09 22:56:32] iteration 3725/ 11920 | consumed samples: 3814400 | elapsed time per iteration (ms): 5926.0 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.958555E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:29:23.877722 | finish at 2025-09-10 12:25:56 + [2025-09-09 22:56:38] iteration 3726/ 11920 | consumed samples: 3815424 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.945807E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:48:20.482722 | finish at 2025-09-10 11:44:58 + [2025-09-09 22:56:43] iteration 3727/ 11920 | consumed samples: 3816448 | elapsed time per iteration (ms): 5629.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.960676E+00 | loss scale: 1.0 | grad norm: 0.135 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:48:43.623779 | finish at 2025-09-10 11:45:27 + [2025-09-09 22:56:49] iteration 3728/ 11920 | consumed samples: 3817472 | elapsed time per iteration (ms): 5626.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.954516E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:48:12.013672 | finish at 2025-09-10 11:45:01 + [2025-09-09 22:56:54] iteration 3729/ 11920 | consumed samples: 3818496 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.955198E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:47:52.619355 | finish at 2025-09-10 11:44:47 + [2025-09-09 22:57:00] iteration 3730/ 11920 | consumed samples: 3819520 | elapsed time per iteration (ms): 6004.0 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.969400E+00 | loss scale: 1.0 | grad norm: 0.134 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:39:33.095434 | finish at 2025-09-10 12:36:34 + [2025-09-09 22:57:06] iteration 3731/ 11920 | consumed samples: 3820544 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.956800E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:47:24.131958 | finish at 2025-09-10 11:44:30 + [2025-09-09 22:57:12] iteration 3732/ 11920 | consumed samples: 3821568 | elapsed time per iteration (ms): 5626.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.976268E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:47:46.036845 | finish at 2025-09-10 11:44:58 + [2025-09-09 22:57:18] iteration 3733/ 11920 | consumed samples: 3822592 | elapsed time per iteration (ms): 6019.1 | throughput per GPU (TFLOP/s/GPU): 75.0 | MFU 7.58% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.955436E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:41:18.470846 | finish at 2025-09-10 12:38:36 + [2025-09-09 22:57:23] iteration 3734/ 11920 | consumed samples: 3823616 | elapsed time per iteration (ms): 5619.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.958657E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:46:36.922873 | finish at 2025-09-10 11:44:00 + [2025-09-09 22:57:29] iteration 3735/ 11920 | consumed samples: 3824640 | elapsed time per iteration (ms): 5980.0 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.970235E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:35:45.993778 | finish at 2025-09-10 12:33:15 + [2025-09-09 22:57:35] iteration 3736/ 11920 | consumed samples: 3825664 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.955896E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:47:17.606827 | finish at 2025-09-10 11:44:53 + [2025-09-09 22:57:41] iteration 3737/ 11920 | consumed samples: 3826688 | elapsed time per iteration (ms): 5626.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.950882E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:47:20.442905 | finish at 2025-09-10 11:45:01 + [2025-09-09 22:57:46] iteration 3738/ 11920 | consumed samples: 3827712 | elapsed time per iteration (ms): 5640.9 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.951565E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:49:14.106304 | finish at 2025-09-10 11:47:00 + [2025-09-09 22:57:52] iteration 3739/ 11920 | consumed samples: 3828736 | elapsed time per iteration (ms): 5970.2 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.949500E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:34:01.998002 | finish at 2025-09-10 12:31:54 + [2025-09-09 22:57:58] iteration 3740/ 11920 | consumed samples: 3829760 | elapsed time per iteration (ms): 5631.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.949221E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:47:43.491602 | finish at 2025-09-10 11:45:41 + [2025-09-09 22:58:03] iteration 3741/ 11920 | consumed samples: 3830784 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.947159E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:46:25.136115 | finish at 2025-09-10 11:44:29 + [2025-09-09 22:58:09] iteration 3742/ 11920 | consumed samples: 3831808 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.955560E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:46:14.395582 | finish at 2025-09-10 11:44:23 + [2025-09-09 22:58:15] iteration 3743/ 11920 | consumed samples: 3832832 | elapsed time per iteration (ms): 5831.4 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.947596E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:14:43.682701 | finish at 2025-09-10 12:12:59 + [2025-09-09 22:58:21] iteration 3744/ 11920 | consumed samples: 3833856 | elapsed time per iteration (ms): 5630.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.942551E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:47:10.532009 | finish at 2025-09-10 11:45:31 + [2025-09-09 22:58:26] iteration 3745/ 11920 | consumed samples: 3834880 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.946734E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:45:45.572877 | finish at 2025-09-10 11:44:12 + [2025-09-09 22:58:32] iteration 3746/ 11920 | consumed samples: 3835904 | elapsed time per iteration (ms): 5626.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.943414E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:46:29.909020 | finish at 2025-09-10 11:45:02 + [2025-09-09 22:58:37] iteration 3747/ 11920 | consumed samples: 3836928 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.950122E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:46:12.731384 | finish at 2025-09-10 11:44:50 + [2025-09-09 22:58:43] iteration 3748/ 11920 | consumed samples: 3837952 | elapsed time per iteration (ms): 5630.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.964033E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:46:53.212343 | finish at 2025-09-10 11:45:36 + [2025-09-09 22:58:49] iteration 3749/ 11920 | consumed samples: 3838976 | elapsed time per iteration (ms): 5630.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.961568E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:46:42.337415 | finish at 2025-09-10 11:45:31 + [2025-09-09 22:58:54] iteration 3750/ 11920 | consumed samples: 3840000 | elapsed time per iteration (ms): 5833.3 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.960228E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:14:17.931421 | finish at 2025-09-10 12:13:12 + [2025-09-09 22:59:00] iteration 3751/ 11920 | consumed samples: 3841024 | elapsed time per iteration (ms): 5633.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.963784E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:46:58.212051 | finish at 2025-09-10 11:45:58 + [2025-09-09 22:59:06] iteration 3752/ 11920 | consumed samples: 3842048 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.976803E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:45:35.356461 | finish at 2025-09-10 11:44:41 + [2025-09-09 22:59:11] iteration 3753/ 11920 | consumed samples: 3843072 | elapsed time per iteration (ms): 5617.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.954873E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:44:34.849862 | finish at 2025-09-10 11:43:46 + [2025-09-09 22:59:17] iteration 3754/ 11920 | consumed samples: 3844096 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.957844E+00 | loss scale: 1.0 | grad norm: 0.230 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:45:39.104047 | finish at 2025-09-10 11:44:56 + [2025-09-09 22:59:23] iteration 3755/ 11920 | consumed samples: 3845120 | elapsed time per iteration (ms): 5957.5 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.939341E+00 | loss scale: 1.0 | grad norm: 0.243 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:30:43.143079 | finish at 2025-09-10 12:30:06 + [2025-09-09 22:59:29] iteration 3756/ 11920 | consumed samples: 3846144 | elapsed time per iteration (ms): 5619.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.967362E+00 | loss scale: 1.0 | grad norm: 0.256 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:44:33.379406 | finish at 2025-09-10 11:44:02 + [2025-09-09 22:59:34] iteration 3757/ 11920 | consumed samples: 3847168 | elapsed time per iteration (ms): 5622.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.979667E+00 | loss scale: 1.0 | grad norm: 0.252 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:44:57.144314 | finish at 2025-09-10 11:44:31 + [2025-09-09 22:59:40] iteration 3758/ 11920 | consumed samples: 3848192 | elapsed time per iteration (ms): 5636.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.949321E+00 | loss scale: 1.0 | grad norm: 0.253 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:46:48.069913 | finish at 2025-09-10 11:46:28 + [2025-09-09 22:59:45] iteration 3759/ 11920 | consumed samples: 3849216 | elapsed time per iteration (ms): 5631.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.964103E+00 | loss scale: 1.0 | grad norm: 0.261 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:45:55.550589 | finish at 2025-09-10 11:45:41 + [2025-09-09 22:59:51] iteration 3760/ 11920 | consumed samples: 3850240 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.970505E+00 | loss scale: 1.0 | grad norm: 0.266 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:44:45.945740 | finish at 2025-09-10 11:44:37 + [2025-09-09 22:59:57] iteration 3761/ 11920 | consumed samples: 3851264 | elapsed time per iteration (ms): 5634.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.971372E+00 | loss scale: 1.0 | grad norm: 0.296 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:46:14.848345 | finish at 2025-09-10 11:46:12 + [2025-09-09 23:00:02] iteration 3762/ 11920 | consumed samples: 3852288 | elapsed time per iteration (ms): 5634.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.963487E+00 | loss scale: 1.0 | grad norm: 0.294 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:46:02.483716 | finish at 2025-09-10 11:46:05 + [2025-09-09 23:00:08] iteration 3763/ 11920 | consumed samples: 3853312 | elapsed time per iteration (ms): 5636.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.969103E+00 | loss scale: 1.0 | grad norm: 0.264 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:46:14.628860 | finish at 2025-09-10 11:46:23 + [2025-09-09 23:00:14] iteration 3764/ 11920 | consumed samples: 3854336 | elapsed time per iteration (ms): 5632.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.949484E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:45:41.230417 | finish at 2025-09-10 11:45:55 + [2025-09-09 23:00:19] iteration 3765/ 11920 | consumed samples: 3855360 | elapsed time per iteration (ms): 5817.7 | throughput per GPU (TFLOP/s/GPU): 77.6 | MFU 7.85% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.970598E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:10:43.066669 | finish at 2025-09-10 12:11:02 + [2025-09-09 23:00:25] iteration 3766/ 11920 | consumed samples: 3856384 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.972821E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:44:03.728006 | finish at 2025-09-10 11:44:29 + [2025-09-09 23:00:31] iteration 3767/ 11920 | consumed samples: 3857408 | elapsed time per iteration (ms): 5628.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.961180E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:44:48.806599 | finish at 2025-09-10 11:45:19 + [2025-09-09 23:00:36] iteration 3768/ 11920 | consumed samples: 3858432 | elapsed time per iteration (ms): 5631.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.956152E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:45:04.413788 | finish at 2025-09-10 11:45:41 + [2025-09-09 23:00:43] iteration 3769/ 11920 | consumed samples: 3859456 | elapsed time per iteration (ms): 6313.5 | throughput per GPU (TFLOP/s/GPU): 71.5 | MFU 7.23% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.955573E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:17:41.025030 | finish at 2025-09-10 13:18:24 + [2025-09-09 23:00:49] iteration 3770/ 11920 | consumed samples: 3860480 | elapsed time per iteration (ms): 5882.4 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.962066E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:19:01.956687 | finish at 2025-09-10 12:19:50 + [2025-09-09 23:00:54] iteration 3771/ 11920 | consumed samples: 3861504 | elapsed time per iteration (ms): 5851.1 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.959619E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:14:40.427123 | finish at 2025-09-10 12:15:35 + [2025-09-09 23:01:01] iteration 3772/ 11920 | consumed samples: 3862528 | elapsed time per iteration (ms): 6247.7 | throughput per GPU (TFLOP/s/GPU): 72.3 | MFU 7.31% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.962686E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:08:26.337110 | finish at 2025-09-10 13:09:27 + [2025-09-09 23:01:06] iteration 3773/ 11920 | consumed samples: 3863552 | elapsed time per iteration (ms): 5629.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.961662E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:44:23.553267 | finish at 2025-09-10 11:45:30 + [2025-09-09 23:01:12] iteration 3774/ 11920 | consumed samples: 3864576 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.944708E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:43:40.346897 | finish at 2025-09-10 11:44:52 + [2025-09-09 23:01:17] iteration 3775/ 11920 | consumed samples: 3865600 | elapsed time per iteration (ms): 5627.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.951252E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:43:55.215082 | finish at 2025-09-10 11:45:13 + [2025-09-09 23:01:23] iteration 3776/ 11920 | consumed samples: 3866624 | elapsed time per iteration (ms): 5632.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.950235E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:44:29.663971 | finish at 2025-09-10 11:45:53 + [2025-09-09 23:01:29] iteration 3777/ 11920 | consumed samples: 3867648 | elapsed time per iteration (ms): 5845.8 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.960585E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:13:22.302178 | finish at 2025-09-10 12:14:51 + [2025-09-09 23:01:35] iteration 3778/ 11920 | consumed samples: 3868672 | elapsed time per iteration (ms): 5877.5 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.962148E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:17:34.822881 | finish at 2025-09-10 12:19:10 + [2025-09-09 23:01:40] iteration 3779/ 11920 | consumed samples: 3869696 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.956955E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:43:12.851324 | finish at 2025-09-10 11:44:53 + [2025-09-09 23:01:46] iteration 3780/ 11920 | consumed samples: 3870720 | elapsed time per iteration (ms): 5636.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.945909E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:44:41.431198 | finish at 2025-09-10 11:46:28 + [2025-09-09 23:01:52] iteration 3781/ 11920 | consumed samples: 3871744 | elapsed time per iteration (ms): 5635.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.954101E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:44:27.262329 | finish at 2025-09-10 11:46:19 + [2025-09-09 23:01:57] iteration 3782/ 11920 | consumed samples: 3872768 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.964637E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:42:50.898789 | finish at 2025-09-10 11:44:48 + [2025-09-09 23:02:03] iteration 3783/ 11920 | consumed samples: 3873792 | elapsed time per iteration (ms): 5928.6 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.954057E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:24:00.644371 | finish at 2025-09-10 12:26:04 + [2025-09-09 23:02:09] iteration 3784/ 11920 | consumed samples: 3874816 | elapsed time per iteration (ms): 5626.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.946103E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:42:52.865782 | finish at 2025-09-10 11:45:02 + [2025-09-09 23:02:15] iteration 3785/ 11920 | consumed samples: 3875840 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.945029E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:42:43.740894 | finish at 2025-09-10 11:44:58 + [2025-09-09 23:02:20] iteration 3786/ 11920 | consumed samples: 3876864 | elapsed time per iteration (ms): 5837.9 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.956163E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:11:25.877471 | finish at 2025-09-10 12:13:46 + [2025-09-09 23:02:26] iteration 3787/ 11920 | consumed samples: 3877888 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.946835E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:42:09.271536 | finish at 2025-09-10 11:44:35 + [2025-09-09 23:02:32] iteration 3788/ 11920 | consumed samples: 3878912 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.953686E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:41:52.566560 | finish at 2025-09-10 11:44:24 + [2025-09-09 23:02:38] iteration 3789/ 11920 | consumed samples: 3879936 | elapsed time per iteration (ms): 5921.7 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.940490E+00 | loss scale: 1.0 | grad norm: 0.133 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:22:29.579213 | finish at 2025-09-10 12:25:07 + [2025-09-09 23:02:43] iteration 3790/ 11920 | consumed samples: 3880960 | elapsed time per iteration (ms): 5625.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.953774E+00 | loss scale: 1.0 | grad norm: 0.135 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:42:14.335842 | finish at 2025-09-10 11:44:58 + [2025-09-09 23:02:49] iteration 3791/ 11920 | consumed samples: 3881984 | elapsed time per iteration (ms): 5626.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.943708E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:42:17.650939 | finish at 2025-09-10 11:45:06 + [2025-09-09 23:02:54] iteration 3792/ 11920 | consumed samples: 3883008 | elapsed time per iteration (ms): 5626.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.940823E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:42:15.208374 | finish at 2025-09-10 11:45:10 + [2025-09-09 23:03:00] iteration 3793/ 11920 | consumed samples: 3884032 | elapsed time per iteration (ms): 5861.1 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.953552E+00 | loss scale: 1.0 | grad norm: 0.134 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:13:52.994664 | finish at 2025-09-10 12:16:53 + [2025-09-09 23:03:06] iteration 3794/ 11920 | consumed samples: 3885056 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.957794E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:41:42.817714 | finish at 2025-09-10 11:44:49 + [2025-09-09 23:03:12] iteration 3795/ 11920 | consumed samples: 3886080 | elapsed time per iteration (ms): 5628.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.960606E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:42:13.117908 | finish at 2025-09-10 11:45:25 + [2025-09-09 23:03:17] iteration 3796/ 11920 | consumed samples: 3887104 | elapsed time per iteration (ms): 5630.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.959303E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:42:20.900399 | finish at 2025-09-10 11:45:38 + [2025-09-09 23:03:23] iteration 3797/ 11920 | consumed samples: 3888128 | elapsed time per iteration (ms): 5629.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.972043E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:42:10.984197 | finish at 2025-09-10 11:45:34 + [2025-09-09 23:03:28] iteration 3798/ 11920 | consumed samples: 3889152 | elapsed time per iteration (ms): 5629.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.962794E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:42:04.657266 | finish at 2025-09-10 11:45:33 + [2025-09-09 23:03:34] iteration 3799/ 11920 | consumed samples: 3890176 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.952259E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:41:22.491494 | finish at 2025-09-10 11:44:57 + [2025-09-09 23:03:40] iteration 3800/ 11920 | consumed samples: 3891200 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.964236E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:41:03.829517 | finish at 2025-09-10 11:44:44 + [2025-09-09 23:03:45] iteration 3801/ 11920 | consumed samples: 3892224 | elapsed time per iteration (ms): 5631.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.956598E+00 | loss scale: 1.0 | grad norm: 0.229 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:41:58.708771 | finish at 2025-09-10 11:45:44 + [2025-09-09 23:03:51] iteration 3802/ 11920 | consumed samples: 3893248 | elapsed time per iteration (ms): 5629.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.957061E+00 | loss scale: 1.0 | grad norm: 0.279 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:41:39.424805 | finish at 2025-09-10 11:45:30 + [2025-09-09 23:03:57] iteration 3803/ 11920 | consumed samples: 3894272 | elapsed time per iteration (ms): 5655.7 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.983379E+00 | loss scale: 1.0 | grad norm: 0.381 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:45:07.330190 | finish at 2025-09-10 11:49:04 + [2025-09-09 23:04:02] iteration 3804/ 11920 | consumed samples: 3895296 | elapsed time per iteration (ms): 5652.6 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.018375E+00 | loss scale: 1.0 | grad norm: 0.696 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:44:36.246585 | finish at 2025-09-10 11:48:39 + [2025-09-09 23:04:08] iteration 3805/ 11920 | consumed samples: 3896320 | elapsed time per iteration (ms): 5897.3 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.031997E+00 | loss scale: 1.0 | grad norm: 0.419 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:17:36.258695 | finish at 2025-09-10 12:21:44 + [2025-09-09 23:04:14] iteration 3806/ 11920 | consumed samples: 3897344 | elapsed time per iteration (ms): 5658.4 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.049075E+00 | loss scale: 1.0 | grad norm: 0.550 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:45:12.174892 | finish at 2025-09-10 11:49:26 + [2025-09-09 23:04:19] iteration 3807/ 11920 | consumed samples: 3898368 | elapsed time per iteration (ms): 5641.0 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.035576E+00 | loss scale: 1.0 | grad norm: 0.442 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:42:45.398446 | finish at 2025-09-10 11:47:05 + [2025-09-09 23:04:25] iteration 3808/ 11920 | consumed samples: 3899392 | elapsed time per iteration (ms): 5648.7 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.038403E+00 | loss scale: 1.0 | grad norm: 0.454 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:43:42.322083 | finish at 2025-09-10 11:48:07 + [2025-09-09 23:04:31] iteration 3809/ 11920 | consumed samples: 3900416 | elapsed time per iteration (ms): 5677.4 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.035615E+00 | loss scale: 1.0 | grad norm: 0.477 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:47:29.693985 | finish at 2025-09-10 11:52:00 + [2025-09-09 23:04:36] iteration 3810/ 11920 | consumed samples: 3901440 | elapsed time per iteration (ms): 5667.0 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.060657E+00 | loss scale: 1.0 | grad norm: 0.561 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:45:59.368515 | finish at 2025-09-10 11:50:36 + [2025-09-09 23:04:43] iteration 3811/ 11920 | consumed samples: 3902464 | elapsed time per iteration (ms): 6087.4 | throughput per GPU (TFLOP/s/GPU): 74.2 | MFU 7.50% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.053898E+00 | loss scale: 1.0 | grad norm: 0.405 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:42:42.596739 | finish at 2025-09-10 12:47:25 + [2025-09-09 23:04:48] iteration 3812/ 11920 | consumed samples: 3903488 | elapsed time per iteration (ms): 5651.8 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.049073E+00 | loss scale: 1.0 | grad norm: 0.361 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:43:44.760866 | finish at 2025-09-10 11:48:33 + [2025-09-09 23:04:54] iteration 3813/ 11920 | consumed samples: 3904512 | elapsed time per iteration (ms): 5639.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.048892E+00 | loss scale: 1.0 | grad norm: 0.289 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:42:01.202009 | finish at 2025-09-10 11:46:55 + [2025-09-09 23:04:59] iteration 3814/ 11920 | consumed samples: 3905536 | elapsed time per iteration (ms): 5646.4 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.018328E+00 | loss scale: 1.0 | grad norm: 0.247 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:42:49.592576 | finish at 2025-09-10 11:47:49 + [2025-09-09 23:05:05] iteration 3815/ 11920 | consumed samples: 3906560 | elapsed time per iteration (ms): 5646.9 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.021331E+00 | loss scale: 1.0 | grad norm: 0.255 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:42:48.382941 | finish at 2025-09-10 11:47:54 + [2025-09-09 23:05:11] iteration 3816/ 11920 | consumed samples: 3907584 | elapsed time per iteration (ms): 5635.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.012028E+00 | loss scale: 1.0 | grad norm: 0.251 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:41:06.393505 | finish at 2025-09-10 11:46:17 + [2025-09-09 23:05:17] iteration 3817/ 11920 | consumed samples: 3908608 | elapsed time per iteration (ms): 6038.9 | throughput per GPU (TFLOP/s/GPU): 74.8 | MFU 7.56% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.023051E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:35:33.536234 | finish at 2025-09-10 12:40:50 + [2025-09-09 23:05:23] iteration 3818/ 11920 | consumed samples: 3909632 | elapsed time per iteration (ms): 5951.7 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.001240E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:23:40.769758 | finish at 2025-09-10 12:29:04 + [2025-09-09 23:05:28] iteration 3819/ 11920 | consumed samples: 3910656 | elapsed time per iteration (ms): 5629.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.001870E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:40:03.964595 | finish at 2025-09-10 11:45:32 + [2025-09-09 23:05:34] iteration 3820/ 11920 | consumed samples: 3911680 | elapsed time per iteration (ms): 5963.2 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.996542E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:25:01.675487 | finish at 2025-09-10 12:30:36 + [2025-09-09 23:05:40] iteration 3821/ 11920 | consumed samples: 3912704 | elapsed time per iteration (ms): 5628.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.992089E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:39:48.594750 | finish at 2025-09-10 11:45:29 + [2025-09-09 23:05:46] iteration 3822/ 11920 | consumed samples: 3913728 | elapsed time per iteration (ms): 5854.8 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.000357E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:10:12.268752 | finish at 2025-09-10 12:15:58 + [2025-09-09 23:05:51] iteration 3823/ 11920 | consumed samples: 3914752 | elapsed time per iteration (ms): 5631.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.000219E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:39:56.597268 | finish at 2025-09-10 11:45:48 + [2025-09-09 23:05:57] iteration 3824/ 11920 | consumed samples: 3915776 | elapsed time per iteration (ms): 5982.5 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.983687E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:27:14.222176 | finish at 2025-09-10 12:33:12 + [2025-09-09 23:06:03] iteration 3825/ 11920 | consumed samples: 3916800 | elapsed time per iteration (ms): 5630.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.968040E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:39:35.076736 | finish at 2025-09-10 11:45:38 + [2025-09-09 23:06:09] iteration 3826/ 11920 | consumed samples: 3917824 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.978547E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:38:48.858067 | finish at 2025-09-10 11:44:58 + [2025-09-09 23:06:14] iteration 3827/ 11920 | consumed samples: 3918848 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.980909E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:38:29.479424 | finish at 2025-09-10 11:44:44 + [2025-09-09 23:06:20] iteration 3828/ 11920 | consumed samples: 3919872 | elapsed time per iteration (ms): 5627.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.975144E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:38:55.534939 | finish at 2025-09-10 11:45:15 + [2025-09-09 23:06:26] iteration 3829/ 11920 | consumed samples: 3920896 | elapsed time per iteration (ms): 5640.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.971923E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:40:34.257455 | finish at 2025-09-10 11:47:00 + [2025-09-09 23:06:31] iteration 3830/ 11920 | consumed samples: 3921920 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.984794E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:38:25.939462 | finish at 2025-09-10 11:44:57 + [2025-09-09 23:06:37] iteration 3831/ 11920 | consumed samples: 3922944 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.963081E+00 | loss scale: 1.0 | grad norm: 0.120 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:38:19.303931 | finish at 2025-09-10 11:44:56 + [2025-09-09 23:06:42] iteration 3832/ 11920 | consumed samples: 3923968 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.964864E+00 | loss scale: 1.0 | grad norm: 0.108 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:37:54.368803 | finish at 2025-09-10 11:44:37 + [2025-09-09 23:06:48] iteration 3833/ 11920 | consumed samples: 3924992 | elapsed time per iteration (ms): 5637.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.972305E+00 | loss scale: 1.0 | grad norm: 0.112 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:39:46.354124 | finish at 2025-09-10 11:46:34 + [2025-09-09 23:06:54] iteration 3834/ 11920 | consumed samples: 3926016 | elapsed time per iteration (ms): 5627.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.974779E+00 | loss scale: 1.0 | grad norm: 0.112 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:38:26.386846 | finish at 2025-09-10 11:45:20 + [2025-09-09 23:06:59] iteration 3835/ 11920 | consumed samples: 3927040 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.966741E+00 | loss scale: 1.0 | grad norm: 0.112 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:38:32.625439 | finish at 2025-09-10 11:45:32 + [2025-09-09 23:07:05] iteration 3836/ 11920 | consumed samples: 3928064 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.965919E+00 | loss scale: 1.0 | grad norm: 0.121 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:37:43.213903 | finish at 2025-09-10 11:44:48 + [2025-09-09 23:07:11] iteration 3837/ 11920 | consumed samples: 3929088 | elapsed time per iteration (ms): 5876.4 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.959427E+00 | loss scale: 1.0 | grad norm: 0.131 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:11:38.883337 | finish at 2025-09-10 12:18:50 + [2025-09-09 23:07:16] iteration 3838/ 11920 | consumed samples: 3930112 | elapsed time per iteration (ms): 5618.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.972000E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:36:49.102334 | finish at 2025-09-10 11:44:06 + [2025-09-09 23:07:22] iteration 3839/ 11920 | consumed samples: 3931136 | elapsed time per iteration (ms): 5629.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.968554E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:38:12.618809 | finish at 2025-09-10 11:45:35 + [2025-09-09 23:07:28] iteration 3840/ 11920 | consumed samples: 3932160 | elapsed time per iteration (ms): 5628.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.949413E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:37:58.595810 | finish at 2025-09-10 11:45:26 + [2025-09-09 23:07:34] iteration 3841/ 11920 | consumed samples: 3933184 | elapsed time per iteration (ms): 5984.0 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.964522E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:25:44.762705 | finish at 2025-09-10 12:33:18 + [2025-09-09 23:07:39] iteration 3842/ 11920 | consumed samples: 3934208 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.963376E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:36:49.494891 | finish at 2025-09-10 11:44:29 + [2025-09-09 23:07:45] iteration 3843/ 11920 | consumed samples: 3935232 | elapsed time per iteration (ms): 5653.4 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.972036E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:41:02.266785 | finish at 2025-09-10 11:48:47 + [2025-09-09 23:07:51] iteration 3844/ 11920 | consumed samples: 3936256 | elapsed time per iteration (ms): 5918.3 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.959280E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:16:36.451964 | finish at 2025-09-10 12:24:27 + [2025-09-09 23:07:57] iteration 3845/ 11920 | consumed samples: 3937280 | elapsed time per iteration (ms): 5973.3 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.968517E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:23:54.314555 | finish at 2025-09-10 12:31:51 + [2025-09-09 23:08:03] iteration 3846/ 11920 | consumed samples: 3938304 | elapsed time per iteration (ms): 5633.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.955891E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:38:04.519827 | finish at 2025-09-10 11:46:07 + [2025-09-09 23:08:08] iteration 3847/ 11920 | consumed samples: 3939328 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.960163E+00 | loss scale: 1.0 | grad norm: 0.263 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:36:41.644102 | finish at 2025-09-10 11:44:50 + [2025-09-09 23:08:14] iteration 3848/ 11920 | consumed samples: 3940352 | elapsed time per iteration (ms): 5627.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.965318E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:37:07.295504 | finish at 2025-09-10 11:45:21 + [2025-09-09 23:08:19] iteration 3849/ 11920 | consumed samples: 3941376 | elapsed time per iteration (ms): 5636.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.976043E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:38:12.433004 | finish at 2025-09-10 11:46:32 + [2025-09-09 23:08:25] iteration 3850/ 11920 | consumed samples: 3942400 | elapsed time per iteration (ms): 5639.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.975914E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:38:34.112041 | finish at 2025-09-10 11:46:59 + [2025-09-09 23:08:31] iteration 3851/ 11920 | consumed samples: 3943424 | elapsed time per iteration (ms): 5636.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.959972E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:37:59.674771 | finish at 2025-09-10 11:46:30 + [2025-09-09 23:08:36] iteration 3852/ 11920 | consumed samples: 3944448 | elapsed time per iteration (ms): 5625.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.954581E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:36:26.018193 | finish at 2025-09-10 11:45:02 + [2025-09-09 23:08:42] iteration 3853/ 11920 | consumed samples: 3945472 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.950148E+00 | loss scale: 1.0 | grad norm: 0.127 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:36:06.071697 | finish at 2025-09-10 11:44:48 + [2025-09-09 23:08:48] iteration 3854/ 11920 | consumed samples: 3946496 | elapsed time per iteration (ms): 5979.6 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.969741E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:23:51.108986 | finish at 2025-09-10 12:32:39 + [2025-09-09 23:08:54] iteration 3855/ 11920 | consumed samples: 3947520 | elapsed time per iteration (ms): 5646.0 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.956479E+00 | loss scale: 1.0 | grad norm: 0.121 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:38:54.743137 | finish at 2025-09-10 11:47:48 + [2025-09-09 23:08:59] iteration 3856/ 11920 | consumed samples: 3948544 | elapsed time per iteration (ms): 5632.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.964647E+00 | loss scale: 1.0 | grad norm: 0.128 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:37:01.042786 | finish at 2025-09-10 11:46:00 + [2025-09-09 23:09:05] iteration 3857/ 11920 | consumed samples: 3949568 | elapsed time per iteration (ms): 5628.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.959158E+00 | loss scale: 1.0 | grad norm: 0.124 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:36:21.695708 | finish at 2025-09-10 11:45:27 + [2025-09-09 23:09:10] iteration 3858/ 11920 | consumed samples: 3950592 | elapsed time per iteration (ms): 5627.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.961964E+00 | loss scale: 1.0 | grad norm: 0.128 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:36:12.442182 | finish at 2025-09-10 11:45:23 + [2025-09-09 23:09:16] iteration 3859/ 11920 | consumed samples: 3951616 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.954434E+00 | loss scale: 1.0 | grad norm: 0.129 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:35:24.303910 | finish at 2025-09-10 11:44:40 + [2025-09-09 23:09:22] iteration 3860/ 11920 | consumed samples: 3952640 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.947368E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:35:28.750710 | finish at 2025-09-10 11:44:50 + [2025-09-09 23:09:27] iteration 3861/ 11920 | consumed samples: 3953664 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.949670E+00 | loss scale: 1.0 | grad norm: 0.142 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:35:42.000859 | finish at 2025-09-10 11:45:09 + [2025-09-09 23:09:33] iteration 3862/ 11920 | consumed samples: 3954688 | elapsed time per iteration (ms): 5868.7 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.944165E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:08:10.355608 | finish at 2025-09-10 12:17:44 + [2025-09-09 23:09:39] iteration 3863/ 11920 | consumed samples: 3955712 | elapsed time per iteration (ms): 5634.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.964313E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:36:35.812454 | finish at 2025-09-10 11:46:15 + [2025-09-09 23:09:44] iteration 3864/ 11920 | consumed samples: 3956736 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.946883E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:34:34.359907 | finish at 2025-09-10 11:44:19 + [2025-09-09 23:09:50] iteration 3865/ 11920 | consumed samples: 3957760 | elapsed time per iteration (ms): 5631.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.951786E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:35:58.172010 | finish at 2025-09-10 11:45:48 + [2025-09-09 23:09:56] iteration 3866/ 11920 | consumed samples: 3958784 | elapsed time per iteration (ms): 5632.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.956796E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:36:05.513980 | finish at 2025-09-10 11:46:01 + [2025-09-09 23:10:01] iteration 3867/ 11920 | consumed samples: 3959808 | elapsed time per iteration (ms): 5634.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.949444E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:36:15.106791 | finish at 2025-09-10 11:46:16 + [2025-09-09 23:10:07] iteration 3868/ 11920 | consumed samples: 3960832 | elapsed time per iteration (ms): 5625.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.955469E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:34:55.830760 | finish at 2025-09-10 11:45:03 + [2025-09-09 23:10:13] iteration 3869/ 11920 | consumed samples: 3961856 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.957246E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:34:32.373117 | finish at 2025-09-10 11:44:45 + [2025-09-09 23:10:18] iteration 3870/ 11920 | consumed samples: 3962880 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.944569E+00 | loss scale: 1.0 | grad norm: 0.243 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:34:13.721917 | finish at 2025-09-10 11:44:32 + [2025-09-09 23:10:24] iteration 3871/ 11920 | consumed samples: 3963904 | elapsed time per iteration (ms): 5632.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.984769E+00 | loss scale: 1.0 | grad norm: 0.262 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:35:34.639046 | finish at 2025-09-10 11:45:58 + [2025-09-09 23:10:29] iteration 3872/ 11920 | consumed samples: 3964928 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.946772E+00 | loss scale: 1.0 | grad norm: 0.246 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:34:22.319073 | finish at 2025-09-10 11:44:52 + [2025-09-09 23:10:35] iteration 3873/ 11920 | consumed samples: 3965952 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.953355E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:34:19.515302 | finish at 2025-09-10 11:44:55 + [2025-09-09 23:10:41] iteration 3874/ 11920 | consumed samples: 3966976 | elapsed time per iteration (ms): 5626.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.965477E+00 | loss scale: 1.0 | grad norm: 0.279 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:34:32.368124 | finish at 2025-09-10 11:45:13 + [2025-09-09 23:10:46] iteration 3875/ 11920 | consumed samples: 3968000 | elapsed time per iteration (ms): 5628.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.957576E+00 | loss scale: 1.0 | grad norm: 0.240 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:34:37.329220 | finish at 2025-09-10 11:45:24 + [2025-09-09 23:10:52] iteration 3876/ 11920 | consumed samples: 3969024 | elapsed time per iteration (ms): 5951.1 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.957020E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:17:50.883269 | finish at 2025-09-10 12:28:43 + [2025-09-09 23:10:58] iteration 3877/ 11920 | consumed samples: 3970048 | elapsed time per iteration (ms): 5632.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.960599E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:35:00.055003 | finish at 2025-09-10 11:45:58 + [2025-09-09 23:11:04] iteration 3878/ 11920 | consumed samples: 3971072 | elapsed time per iteration (ms): 5626.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.950903E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:34:04.251153 | finish at 2025-09-10 11:45:08 + [2025-09-09 23:11:09] iteration 3879/ 11920 | consumed samples: 3972096 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.949403E+00 | loss scale: 1.0 | grad norm: 0.256 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:33:36.150715 | finish at 2025-09-10 11:44:45 + [2025-09-09 23:11:15] iteration 3880/ 11920 | consumed samples: 3973120 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.939023E+00 | loss scale: 1.0 | grad norm: 0.289 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:33:37.667913 | finish at 2025-09-10 11:44:52 + [2025-09-09 23:11:20] iteration 3881/ 11920 | consumed samples: 3974144 | elapsed time per iteration (ms): 5624.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.955687E+00 | loss scale: 1.0 | grad norm: 0.261 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:33:35.915452 | finish at 2025-09-10 11:44:56 + [2025-09-09 23:11:26] iteration 3882/ 11920 | consumed samples: 3975168 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.952603E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:33:04.570764 | finish at 2025-09-10 11:44:31 + [2025-09-09 23:11:32] iteration 3883/ 11920 | consumed samples: 3976192 | elapsed time per iteration (ms): 5623.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.953964E+00 | loss scale: 1.0 | grad norm: 0.240 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:33:12.805220 | finish at 2025-09-10 11:44:44 + [2025-09-09 23:11:37] iteration 3884/ 11920 | consumed samples: 3977216 | elapsed time per iteration (ms): 5616.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.966239E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:32:10.918876 | finish at 2025-09-10 11:43:48 + [2025-09-09 23:11:43] iteration 3885/ 11920 | consumed samples: 3978240 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.975291E+00 | loss scale: 1.0 | grad norm: 0.250 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:33:12.721777 | finish at 2025-09-10 11:44:56 + [2025-09-09 23:11:49] iteration 3886/ 11920 | consumed samples: 3979264 | elapsed time per iteration (ms): 5935.3 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.958896E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:14:44.021045 | finish at 2025-09-10 12:26:33 + [2025-09-09 23:11:54] iteration 3887/ 11920 | consumed samples: 3980288 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.944751E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:32:36.410288 | finish at 2025-09-10 11:44:31 + [2025-09-09 23:12:00] iteration 3888/ 11920 | consumed samples: 3981312 | elapsed time per iteration (ms): 5849.1 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.945033E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:02:59.910828 | finish at 2025-09-10 12:15:00 + [2025-09-09 23:12:06] iteration 3889/ 11920 | consumed samples: 3982336 | elapsed time per iteration (ms): 6095.8 | throughput per GPU (TFLOP/s/GPU): 74.1 | MFU 7.49% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.965621E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:35:55.217917 | finish at 2025-09-10 12:48:02 + [2025-09-09 23:12:12] iteration 3890/ 11920 | consumed samples: 3983360 | elapsed time per iteration (ms): 5617.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.966750E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:31:50.348141 | finish at 2025-09-10 11:44:02 + [2025-09-09 23:12:18] iteration 3891/ 11920 | consumed samples: 3984384 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.940491E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:32:14.780511 | finish at 2025-09-10 11:44:32 + [2025-09-09 23:12:23] iteration 3892/ 11920 | consumed samples: 3985408 | elapsed time per iteration (ms): 5628.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.950663E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:33:02.625398 | finish at 2025-09-10 11:45:26 + [2025-09-09 23:12:29] iteration 3893/ 11920 | consumed samples: 3986432 | elapsed time per iteration (ms): 5623.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.953392E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:32:15.860439 | finish at 2025-09-10 11:44:45 + [2025-09-09 23:12:35] iteration 3894/ 11920 | consumed samples: 3987456 | elapsed time per iteration (ms): 6203.2 | throughput per GPU (TFLOP/s/GPU): 72.8 | MFU 7.36% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.950791E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:49:47.029447 | finish at 2025-09-10 13:02:22 + [2025-09-09 23:12:41] iteration 3895/ 11920 | consumed samples: 3988480 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.956630E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:31:40.173819 | finish at 2025-09-10 11:44:21 + [2025-09-09 23:12:46] iteration 3896/ 11920 | consumed samples: 3989504 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.945595E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:31:37.978256 | finish at 2025-09-10 11:44:24 + [2025-09-09 23:12:52] iteration 3897/ 11920 | consumed samples: 3990528 | elapsed time per iteration (ms): 5629.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.957559E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:32:45.904358 | finish at 2025-09-10 11:45:38 + [2025-09-09 23:12:58] iteration 3898/ 11920 | consumed samples: 3991552 | elapsed time per iteration (ms): 5629.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.959172E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:32:42.321280 | finish at 2025-09-10 11:45:40 + [2025-09-09 23:13:03] iteration 3899/ 11920 | consumed samples: 3992576 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.943071E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:31:59.209306 | finish at 2025-09-10 11:45:02 + [2025-09-09 23:13:09] iteration 3900/ 11920 | consumed samples: 3993600 | elapsed time per iteration (ms): 6127.4 | throughput per GPU (TFLOP/s/GPU): 73.7 | MFU 7.45% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.951194E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:39:01.919460 | finish at 2025-09-10 12:52:11 + [2025-09-09 23:13:15] iteration 3901/ 11920 | consumed samples: 3994624 | elapsed time per iteration (ms): 5964.9 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.951148E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:17:12.321016 | finish at 2025-09-10 12:30:28 + [2025-09-09 23:13:21] iteration 3902/ 11920 | consumed samples: 3995648 | elapsed time per iteration (ms): 5824.4 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.84% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.945312E+00 | loss scale: 1.0 | grad norm: 0.124 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:58:20.178334 | finish at 2025-09-10 12:11:41 + [2025-09-09 23:13:27] iteration 3903/ 11920 | consumed samples: 3996672 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.950721E+00 | loss scale: 1.0 | grad norm: 0.132 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:31:44.822665 | finish at 2025-09-10 11:45:12 + [2025-09-09 23:13:32] iteration 3904/ 11920 | consumed samples: 3997696 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.959424E+00 | loss scale: 1.0 | grad norm: 0.124 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:31:49.933434 | finish at 2025-09-10 11:45:22 + [2025-09-09 23:13:38] iteration 3905/ 11920 | consumed samples: 3998720 | elapsed time per iteration (ms): 5618.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.955886E+00 | loss scale: 1.0 | grad norm: 0.137 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:30:31.602898 | finish at 2025-09-10 11:44:10 + [2025-09-09 23:13:44] iteration 3906/ 11920 | consumed samples: 3999744 | elapsed time per iteration (ms): 5915.0 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.941685E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:10:03.188010 | finish at 2025-09-10 12:23:47 + [2025-09-09 23:13:50] iteration 3907/ 11920 | consumed samples: 4000768 | elapsed time per iteration (ms): 5626.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.949764E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:31:21.307449 | finish at 2025-09-10 11:45:11 + [2025-09-09 23:13:55] iteration 3908/ 11920 | consumed samples: 4001792 | elapsed time per iteration (ms): 5619.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.967900E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:30:21.095277 | finish at 2025-09-10 11:44:16 + [2025-09-09 23:14:01] iteration 3909/ 11920 | consumed samples: 4002816 | elapsed time per iteration (ms): 5960.1 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.928271E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:15:46.461812 | finish at 2025-09-10 12:29:48 + [2025-09-09 23:14:07] iteration 3910/ 11920 | consumed samples: 4003840 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.944885E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:30:33.925223 | finish at 2025-09-10 11:44:41 + [2025-09-09 23:14:12] iteration 3911/ 11920 | consumed samples: 4004864 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.951256E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:30:39.135572 | finish at 2025-09-10 11:44:52 + [2025-09-09 23:14:18] iteration 3912/ 11920 | consumed samples: 4005888 | elapsed time per iteration (ms): 5996.8 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.950132E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:20:22.169676 | finish at 2025-09-10 12:34:41 + [2025-09-09 23:14:24] iteration 3913/ 11920 | consumed samples: 4006912 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.957840E+00 | loss scale: 1.0 | grad norm: 0.134 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:30:25.882064 | finish at 2025-09-10 11:44:50 + [2025-09-09 23:14:30] iteration 3914/ 11920 | consumed samples: 4007936 | elapsed time per iteration (ms): 5630.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.940081E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:31:20.814767 | finish at 2025-09-10 11:45:50 + [2025-09-09 23:14:35] iteration 3915/ 11920 | consumed samples: 4008960 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.934714E+00 | loss scale: 1.0 | grad norm: 0.135 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:30:23.113172 | finish at 2025-09-10 11:44:58 + [2025-09-09 23:14:41] iteration 3916/ 11920 | consumed samples: 4009984 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.932239E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:29:55.390657 | finish at 2025-09-10 11:44:36 + [2025-09-09 23:14:47] iteration 3917/ 11920 | consumed samples: 4011008 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.944038E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:29:54.092717 | finish at 2025-09-10 11:44:41 + [2025-09-09 23:14:52] iteration 3918/ 11920 | consumed samples: 4012032 | elapsed time per iteration (ms): 5825.7 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.84% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.934010E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:56:57.228863 | finish at 2025-09-10 12:11:50 + [2025-09-09 23:14:58] iteration 3919/ 11920 | consumed samples: 4013056 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.949167E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:30:11.155095 | finish at 2025-09-10 11:45:09 + [2025-09-09 23:15:04] iteration 3920/ 11920 | consumed samples: 4014080 | elapsed time per iteration (ms): 5630.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.938586E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:30:45.698166 | finish at 2025-09-10 11:45:49 + [2025-09-09 23:15:09] iteration 3921/ 11920 | consumed samples: 4015104 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.959594E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:29:19.082018 | finish at 2025-09-10 11:44:28 + [2025-09-09 23:15:15] iteration 3922/ 11920 | consumed samples: 4016128 | elapsed time per iteration (ms): 5958.2 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.938679E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:14:13.994243 | finish at 2025-09-10 12:29:29 + [2025-09-09 23:15:21] iteration 3923/ 11920 | consumed samples: 4017152 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.954518E+00 | loss scale: 1.0 | grad norm: 0.135 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:29:22.731649 | finish at 2025-09-10 11:44:44 + [2025-09-09 23:15:26] iteration 3924/ 11920 | consumed samples: 4018176 | elapsed time per iteration (ms): 5629.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.943833E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:30:15.328595 | finish at 2025-09-10 11:45:42 + [2025-09-09 23:15:32] iteration 3925/ 11920 | consumed samples: 4019200 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.942664E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:29:38.323528 | finish at 2025-09-10 11:45:10 + [2025-09-09 23:15:38] iteration 3926/ 11920 | consumed samples: 4020224 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.939112E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:29:15.393890 | finish at 2025-09-10 11:44:53 + [2025-09-09 23:15:44] iteration 3927/ 11920 | consumed samples: 4021248 | elapsed time per iteration (ms): 5952.8 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.937180E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:13:00.637305 | finish at 2025-09-10 12:28:44 + [2025-09-09 23:15:49] iteration 3928/ 11920 | consumed samples: 4022272 | elapsed time per iteration (ms): 5631.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.959857E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:30:04.343307 | finish at 2025-09-10 11:45:54 + [2025-09-09 23:15:55] iteration 3929/ 11920 | consumed samples: 4023296 | elapsed time per iteration (ms): 5988.5 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.951082E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:17:33.807233 | finish at 2025-09-10 12:33:29 + [2025-09-09 23:16:01] iteration 3930/ 11920 | consumed samples: 4024320 | elapsed time per iteration (ms): 5631.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.944966E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:29:53.475287 | finish at 2025-09-10 11:45:54 + [2025-09-09 23:16:07] iteration 3931/ 11920 | consumed samples: 4025344 | elapsed time per iteration (ms): 5630.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.956918E+00 | loss scale: 1.0 | grad norm: 0.248 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:29:40.851814 | finish at 2025-09-10 11:45:47 + [2025-09-09 23:16:12] iteration 3932/ 11920 | consumed samples: 4026368 | elapsed time per iteration (ms): 5627.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.952375E+00 | loss scale: 1.0 | grad norm: 0.243 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:29:14.567298 | finish at 2025-09-10 11:45:27 + [2025-09-09 23:16:18] iteration 3933/ 11920 | consumed samples: 4027392 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.944490E+00 | loss scale: 1.0 | grad norm: 0.267 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:28:51.544219 | finish at 2025-09-10 11:45:09 + [2025-09-09 23:16:23] iteration 3934/ 11920 | consumed samples: 4028416 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.968865E+00 | loss scale: 1.0 | grad norm: 0.283 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:28:20.079304 | finish at 2025-09-10 11:44:43 + [2025-09-09 23:16:29] iteration 3935/ 11920 | consumed samples: 4029440 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.954065E+00 | loss scale: 1.0 | grad norm: 0.244 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:27:59.657029 | finish at 2025-09-10 11:44:29 + [2025-09-09 23:16:35] iteration 3936/ 11920 | consumed samples: 4030464 | elapsed time per iteration (ms): 5851.1 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.939653E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:58:34.835701 | finish at 2025-09-10 12:15:10 + [2025-09-09 23:16:40] iteration 3937/ 11920 | consumed samples: 4031488 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.945891E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:28:25.511267 | finish at 2025-09-10 11:45:06 + [2025-09-09 23:16:46] iteration 3938/ 11920 | consumed samples: 4032512 | elapsed time per iteration (ms): 5629.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.953697E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:28:51.081037 | finish at 2025-09-10 11:45:37 + [2025-09-09 23:16:52] iteration 3939/ 11920 | consumed samples: 4033536 | elapsed time per iteration (ms): 5983.1 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.951024E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 10.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:15:51.303271 | finish at 2025-09-10 12:32:43 + [2025-09-09 23:16:58] iteration 3940/ 11920 | consumed samples: 4034560 | elapsed time per iteration (ms): 5866.5 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.949727E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:00:14.669209 | finish at 2025-09-10 12:17:13 + [2025-09-09 23:17:04] iteration 3941/ 11920 | consumed samples: 4035584 | elapsed time per iteration (ms): 5632.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.944491E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:28:58.509619 | finish at 2025-09-10 11:46:02 + [2025-09-09 23:17:09] iteration 3942/ 11920 | consumed samples: 4036608 | elapsed time per iteration (ms): 5630.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.946566E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:28:43.414557 | finish at 2025-09-10 11:45:53 + [2025-09-09 23:17:15] iteration 3943/ 11920 | consumed samples: 4037632 | elapsed time per iteration (ms): 5618.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.952764E+00 | loss scale: 1.0 | grad norm: 0.282 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:26:59.982139 | finish at 2025-09-10 11:44:15 + [2025-09-09 23:17:20] iteration 3944/ 11920 | consumed samples: 4038656 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.953571E+00 | loss scale: 1.0 | grad norm: 0.275 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:27:46.939659 | finish at 2025-09-10 11:45:07 + [2025-09-09 23:17:26] iteration 3945/ 11920 | consumed samples: 4039680 | elapsed time per iteration (ms): 5868.4 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.946214E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:00:00.180846 | finish at 2025-09-10 12:17:27 + [2025-09-09 23:17:32] iteration 3946/ 11920 | consumed samples: 4040704 | elapsed time per iteration (ms): 5627.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.936380E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:27:50.982021 | finish at 2025-09-10 11:45:23 + [2025-09-09 23:17:38] iteration 3947/ 11920 | consumed samples: 4041728 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.948146E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:27:20.791284 | finish at 2025-09-10 11:44:58 + [2025-09-09 23:17:43] iteration 3948/ 11920 | consumed samples: 4042752 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.949998E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:27:04.823742 | finish at 2025-09-10 11:44:48 + [2025-09-09 23:17:49] iteration 3949/ 11920 | consumed samples: 4043776 | elapsed time per iteration (ms): 5632.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.949666E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:28:17.553973 | finish at 2025-09-10 11:46:06 + [2025-09-09 23:17:54] iteration 3950/ 11920 | consumed samples: 4044800 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.948115E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:26:44.381227 | finish at 2025-09-10 11:44:39 + [2025-09-09 23:18:00] iteration 3951/ 11920 | consumed samples: 4045824 | elapsed time per iteration (ms): 5628.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.945487E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:27:31.713319 | finish at 2025-09-10 11:45:32 + [2025-09-09 23:18:06] iteration 3952/ 11920 | consumed samples: 4046848 | elapsed time per iteration (ms): 5638.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.960029E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:28:48.996391 | finish at 2025-09-10 11:46:55 + [2025-09-09 23:18:11] iteration 3953/ 11920 | consumed samples: 4047872 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.939778E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:26:28.441388 | finish at 2025-09-10 11:44:40 + [2025-09-09 23:18:17] iteration 3954/ 11920 | consumed samples: 4048896 | elapsed time per iteration (ms): 5630.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.962995E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:27:29.296926 | finish at 2025-09-10 11:45:46 + [2025-09-09 23:18:23] iteration 3955/ 11920 | consumed samples: 4049920 | elapsed time per iteration (ms): 5901.0 | throughput per GPU (TFLOP/s/GPU): 76.5 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.942001E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:03:21.159443 | finish at 2025-09-10 12:21:44 + [2025-09-09 23:18:29] iteration 3956/ 11920 | consumed samples: 4050944 | elapsed time per iteration (ms): 5633.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.969842E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:27:48.293575 | finish at 2025-09-10 11:46:17 + [2025-09-09 23:18:34] iteration 3957/ 11920 | consumed samples: 4051968 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.937581E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:26:04.488746 | finish at 2025-09-10 11:44:39 + [2025-09-09 23:18:40] iteration 3958/ 11920 | consumed samples: 4052992 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.945410E+00 | loss scale: 1.0 | grad norm: 0.261 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:26:19.294670 | finish at 2025-09-10 11:44:59 + [2025-09-09 23:18:45] iteration 3959/ 11920 | consumed samples: 4054016 | elapsed time per iteration (ms): 5627.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.959185E+00 | loss scale: 1.0 | grad norm: 0.302 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:26:42.224813 | finish at 2025-09-10 11:45:28 + [2025-09-09 23:18:51] iteration 3960/ 11920 | consumed samples: 4055040 | elapsed time per iteration (ms): 5958.6 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.943338E+00 | loss scale: 1.0 | grad norm: 0.269 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:10:30.680294 | finish at 2025-09-10 12:29:22 + [2025-09-09 23:18:57] iteration 3961/ 11920 | consumed samples: 4056064 | elapsed time per iteration (ms): 5632.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.950167E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:27:10.044219 | finish at 2025-09-10 11:46:07 + [2025-09-09 23:19:03] iteration 3962/ 11920 | consumed samples: 4057088 | elapsed time per iteration (ms): 5627.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.968240E+00 | loss scale: 1.0 | grad norm: 0.238 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:26:20.423780 | finish at 2025-09-10 11:45:23 + [2025-09-09 23:19:08] iteration 3963/ 11920 | consumed samples: 4058112 | elapsed time per iteration (ms): 5629.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.955432E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:26:36.417896 | finish at 2025-09-10 11:45:45 + [2025-09-09 23:19:14] iteration 3964/ 11920 | consumed samples: 4059136 | elapsed time per iteration (ms): 5627.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.963347E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:26:11.544456 | finish at 2025-09-10 11:45:25 + [2025-09-09 23:19:20] iteration 3965/ 11920 | consumed samples: 4060160 | elapsed time per iteration (ms): 5628.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.941910E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:26:11.208632 | finish at 2025-09-10 11:45:31 + [2025-09-09 23:19:25] iteration 3966/ 11920 | consumed samples: 4061184 | elapsed time per iteration (ms): 5848.6 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.940139E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:55:19.401481 | finish at 2025-09-10 12:14:45 + [2025-09-09 23:19:31] iteration 3967/ 11920 | consumed samples: 4062208 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.943470E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:25:18.199446 | finish at 2025-09-10 11:44:49 + [2025-09-09 23:19:37] iteration 3968/ 11920 | consumed samples: 4063232 | elapsed time per iteration (ms): 5628.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.954132E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:25:59.210201 | finish at 2025-09-10 11:45:36 + [2025-09-09 23:19:42] iteration 3969/ 11920 | consumed samples: 4064256 | elapsed time per iteration (ms): 5623.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.957097E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:25:09.376490 | finish at 2025-09-10 11:44:52 + [2025-09-09 23:19:48] iteration 3970/ 11920 | consumed samples: 4065280 | elapsed time per iteration (ms): 5634.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.965378E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:26:35.478809 | finish at 2025-09-10 11:46:23 + [2025-09-09 23:19:54] iteration 3971/ 11920 | consumed samples: 4066304 | elapsed time per iteration (ms): 5861.4 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.945420E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:56:32.066828 | finish at 2025-09-10 12:16:26 + [2025-09-09 23:19:59] iteration 3972/ 11920 | consumed samples: 4067328 | elapsed time per iteration (ms): 5630.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.945417E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:25:47.383031 | finish at 2025-09-10 11:45:47 + [2025-09-09 23:20:05] iteration 3973/ 11920 | consumed samples: 4068352 | elapsed time per iteration (ms): 5966.5 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.944625E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:10:15.389327 | finish at 2025-09-10 12:30:21 + [2025-09-09 23:20:11] iteration 3974/ 11920 | consumed samples: 4069376 | elapsed time per iteration (ms): 5854.5 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.937083E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:55:19.893435 | finish at 2025-09-10 12:15:31 + [2025-09-09 23:20:17] iteration 3975/ 11920 | consumed samples: 4070400 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.939975E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:24:20.338067 | finish at 2025-09-10 11:44:37 + [2025-09-09 23:20:22] iteration 3976/ 11920 | consumed samples: 4071424 | elapsed time per iteration (ms): 5629.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.938517E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:25:23.156467 | finish at 2025-09-10 11:45:46 + [2025-09-09 23:20:28] iteration 3977/ 11920 | consumed samples: 4072448 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.946885E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:24:04.936997 | finish at 2025-09-10 11:44:33 + [2025-09-09 23:20:34] iteration 3978/ 11920 | consumed samples: 4073472 | elapsed time per iteration (ms): 5613.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.929754E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:23:05.830063 | finish at 2025-09-10 11:43:39 + [2025-09-09 23:20:39] iteration 3979/ 11920 | consumed samples: 4074496 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930902E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:24:19.378037 | finish at 2025-09-10 11:44:59 + [2025-09-09 23:20:45] iteration 3980/ 11920 | consumed samples: 4075520 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.944973E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:24:20.656176 | finish at 2025-09-10 11:45:06 + [2025-09-09 23:20:51] iteration 3981/ 11920 | consumed samples: 4076544 | elapsed time per iteration (ms): 5628.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.951253E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:24:46.171837 | finish at 2025-09-10 11:45:37 + [2025-09-09 23:20:56] iteration 3982/ 11920 | consumed samples: 4077568 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.947966E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:23:48.177720 | finish at 2025-09-10 11:44:44 + [2025-09-09 23:21:02] iteration 3983/ 11920 | consumed samples: 4078592 | elapsed time per iteration (ms): 5631.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.946313E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:24:53.164071 | finish at 2025-09-10 11:45:55 + [2025-09-09 23:21:07] iteration 3984/ 11920 | consumed samples: 4079616 | elapsed time per iteration (ms): 5626.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.946499E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:24:13.386536 | finish at 2025-09-10 11:45:21 + [2025-09-09 23:21:13] iteration 3985/ 11920 | consumed samples: 4080640 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.952060E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:23:40.335571 | finish at 2025-09-10 11:44:53 + [2025-09-09 23:21:19] iteration 3986/ 11920 | consumed samples: 4081664 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.953844E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:23:42.080173 | finish at 2025-09-10 11:45:01 + [2025-09-09 23:21:24] iteration 3987/ 11920 | consumed samples: 4082688 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.936660E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:23:42.593524 | finish at 2025-09-10 11:45:07 + [2025-09-09 23:21:30] iteration 3988/ 11920 | consumed samples: 4083712 | elapsed time per iteration (ms): 5880.2 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.941338E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:57:21.420404 | finish at 2025-09-10 12:18:52 + [2025-09-09 23:21:36] iteration 3989/ 11920 | consumed samples: 4084736 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.946562E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:22:48.034536 | finish at 2025-09-10 11:44:24 + [2025-09-09 23:21:41] iteration 3990/ 11920 | consumed samples: 4085760 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.947189E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:22:46.192601 | finish at 2025-09-10 11:44:28 + [2025-09-09 23:21:47] iteration 3991/ 11920 | consumed samples: 4086784 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.945008E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:22:42.066085 | finish at 2025-09-10 11:44:29 + [2025-09-09 23:21:53] iteration 3992/ 11920 | consumed samples: 4087808 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.948606E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:23:11.187502 | finish at 2025-09-10 11:45:04 + [2025-09-09 23:21:59] iteration 3993/ 11920 | consumed samples: 4088832 | elapsed time per iteration (ms): 6004.7 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.947495E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:13:19.507830 | finish at 2025-09-10 12:35:18 + [2025-09-09 23:22:04] iteration 3994/ 11920 | consumed samples: 4089856 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.927309E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:22:59.980037 | finish at 2025-09-10 11:45:04 + [2025-09-09 23:22:10] iteration 3995/ 11920 | consumed samples: 4090880 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.920626E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:22:28.260081 | finish at 2025-09-10 11:44:38 + [2025-09-09 23:22:16] iteration 3996/ 11920 | consumed samples: 4091904 | elapsed time per iteration (ms): 5976.8 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.940011E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:09:20.179925 | finish at 2025-09-10 12:31:36 + [2025-09-09 23:22:22] iteration 3997/ 11920 | consumed samples: 4092928 | elapsed time per iteration (ms): 5632.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.956802E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:23:43.924399 | finish at 2025-09-10 11:46:05 + [2025-09-09 23:22:27] iteration 3998/ 11920 | consumed samples: 4093952 | elapsed time per iteration (ms): 5631.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.936294E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:23:33.530655 | finish at 2025-09-10 11:46:01 + [2025-09-09 23:22:33] iteration 3999/ 11920 | consumed samples: 4094976 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.935033E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:21:59.097371 | finish at 2025-09-10 11:44:32 + [2025-09-09 23:22:38] iteration 4000/ 11920 | consumed samples: 4096000 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.947603E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:22:21.738796 | finish at 2025-09-10 11:45:00 + [2025-09-09 23:22:44] iteration 4001/ 11920 | consumed samples: 4097024 | elapsed time per iteration (ms): 5639.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.952293E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:24:18.051803 | finish at 2025-09-10 11:47:02 + [2025-09-09 23:22:50] iteration 4002/ 11920 | consumed samples: 4098048 | elapsed time per iteration (ms): 5656.6 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.935323E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:26:28.958788 | finish at 2025-09-10 11:49:19 + [2025-09-09 23:22:55] iteration 4003/ 11920 | consumed samples: 4099072 | elapsed time per iteration (ms): 5638.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.918798E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:24:03.484964 | finish at 2025-09-10 11:46:59 + [2025-09-09 23:23:01] iteration 4004/ 11920 | consumed samples: 4100096 | elapsed time per iteration (ms): 5638.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.942591E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:23:53.337214 | finish at 2025-09-10 11:46:54 + [2025-09-09 23:23:07] iteration 4005/ 11920 | consumed samples: 4101120 | elapsed time per iteration (ms): 5638.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.947325E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:23:46.459030 | finish at 2025-09-10 11:46:53 + [2025-09-09 23:23:12] iteration 4006/ 11920 | consumed samples: 4102144 | elapsed time per iteration (ms): 5847.7 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.926886E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:51:18.940258 | finish at 2025-09-10 12:14:31 + [2025-09-09 23:23:18] iteration 4007/ 11920 | consumed samples: 4103168 | elapsed time per iteration (ms): 5624.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.947251E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:21:47.606430 | finish at 2025-09-10 11:45:06 + [2025-09-09 23:23:24] iteration 4008/ 11920 | consumed samples: 4104192 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.952346E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:22:05.067181 | finish at 2025-09-10 11:45:29 + [2025-09-09 23:23:29] iteration 4009/ 11920 | consumed samples: 4105216 | elapsed time per iteration (ms): 5622.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.956125E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:21:20.749472 | finish at 2025-09-10 11:44:50 + [2025-09-09 23:23:35] iteration 4010/ 11920 | consumed samples: 4106240 | elapsed time per iteration (ms): 5838.0 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.946437E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:49:38.721910 | finish at 2025-09-10 12:13:14 + [2025-09-09 23:23:41] iteration 4011/ 11920 | consumed samples: 4107264 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.932803E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:21:03.634145 | finish at 2025-09-10 11:44:44 + [2025-09-09 23:23:46] iteration 4012/ 11920 | consumed samples: 4108288 | elapsed time per iteration (ms): 5628.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.944702E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:21:51.533492 | finish at 2025-09-10 11:45:38 + [2025-09-09 23:23:52] iteration 4013/ 11920 | consumed samples: 4109312 | elapsed time per iteration (ms): 5632.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.947900E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:22:13.296424 | finish at 2025-09-10 11:46:05 + [2025-09-09 23:23:58] iteration 4014/ 11920 | consumed samples: 4110336 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.953671E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:21:16.273358 | finish at 2025-09-10 11:45:14 + [2025-09-09 23:24:03] iteration 4015/ 11920 | consumed samples: 4111360 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.934258E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:20:54.388425 | finish at 2025-09-10 11:44:58 + [2025-09-09 23:24:09] iteration 4016/ 11920 | consumed samples: 4112384 | elapsed time per iteration (ms): 5631.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.943471E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:21:51.293129 | finish at 2025-09-10 11:46:00 + [2025-09-09 23:24:15] iteration 4017/ 11920 | consumed samples: 4113408 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.953214E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:20:27.149876 | finish at 2025-09-10 11:44:42 + [2025-09-09 23:24:21] iteration 4018/ 11920 | consumed samples: 4114432 | elapsed time per iteration (ms): 5983.0 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.947384E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:07:57.297056 | finish at 2025-09-10 12:32:18 + [2025-09-09 23:24:26] iteration 4019/ 11920 | consumed samples: 4115456 | elapsed time per iteration (ms): 5626.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.948836E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:20:52.443887 | finish at 2025-09-10 11:45:19 + [2025-09-09 23:24:32] iteration 4020/ 11920 | consumed samples: 4116480 | elapsed time per iteration (ms): 5954.1 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.941068E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:03:57.019992 | finish at 2025-09-10 12:28:29 + [2025-09-09 23:24:38] iteration 4021/ 11920 | consumed samples: 4117504 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.948042E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:20:35.790315 | finish at 2025-09-10 11:45:14 + [2025-09-09 23:24:44] iteration 4022/ 11920 | consumed samples: 4118528 | elapsed time per iteration (ms): 5982.4 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.948401E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:07:29.211278 | finish at 2025-09-10 12:32:13 + [2025-09-09 23:24:49] iteration 4023/ 11920 | consumed samples: 4119552 | elapsed time per iteration (ms): 5632.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.928576E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:21:18.833383 | finish at 2025-09-10 11:46:08 + [2025-09-09 23:24:55] iteration 4024/ 11920 | consumed samples: 4120576 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.934116E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:19:52.588205 | finish at 2025-09-10 11:44:48 + [2025-09-09 23:25:01] iteration 4025/ 11920 | consumed samples: 4121600 | elapsed time per iteration (ms): 5631.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.939676E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:21:03.792717 | finish at 2025-09-10 11:46:04 + [2025-09-09 23:25:06] iteration 4026/ 11920 | consumed samples: 4122624 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.934480E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:19:57.973908 | finish at 2025-09-10 11:45:04 + [2025-09-09 23:25:12] iteration 4027/ 11920 | consumed samples: 4123648 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.931037E+00 | loss scale: 1.0 | grad norm: 0.134 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:20:03.269945 | finish at 2025-09-10 11:45:15 + [2025-09-09 23:25:17] iteration 4028/ 11920 | consumed samples: 4124672 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.937339E+00 | loss scale: 1.0 | grad norm: 0.130 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:19:34.462988 | finish at 2025-09-10 11:44:52 + [2025-09-09 23:25:23] iteration 4029/ 11920 | consumed samples: 4125696 | elapsed time per iteration (ms): 5981.8 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.941453E+00 | loss scale: 1.0 | grad norm: 0.125 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:06:42.378782 | finish at 2025-09-10 12:32:06 + [2025-09-09 23:25:29] iteration 4030/ 11920 | consumed samples: 4126720 | elapsed time per iteration (ms): 5831.1 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.925783E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:46:47.228408 | finish at 2025-09-10 12:12:17 + [2025-09-09 23:25:35] iteration 4031/ 11920 | consumed samples: 4127744 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.934520E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:19:10.915825 | finish at 2025-09-10 11:44:46 + [2025-09-09 23:25:41] iteration 4032/ 11920 | consumed samples: 4128768 | elapsed time per iteration (ms): 5619.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.942610E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:18:44.038898 | finish at 2025-09-10 11:44:25 + [2025-09-09 23:25:46] iteration 4033/ 11920 | consumed samples: 4129792 | elapsed time per iteration (ms): 5950.7 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.939430E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:02:13.088514 | finish at 2025-09-10 12:28:00 + [2025-09-09 23:25:52] iteration 4034/ 11920 | consumed samples: 4130816 | elapsed time per iteration (ms): 5630.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.932781E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:20:05.366908 | finish at 2025-09-10 11:45:57 + [2025-09-09 23:25:58] iteration 4035/ 11920 | consumed samples: 4131840 | elapsed time per iteration (ms): 5628.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.932967E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:19:39.146998 | finish at 2025-09-10 11:45:37 + [2025-09-09 23:26:03] iteration 4036/ 11920 | consumed samples: 4132864 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.932309E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:18:42.236938 | finish at 2025-09-10 11:44:46 + [2025-09-09 23:26:09] iteration 4037/ 11920 | consumed samples: 4133888 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.945692E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:19:01.172084 | finish at 2025-09-10 11:45:10 + [2025-09-09 23:26:15] iteration 4038/ 11920 | consumed samples: 4134912 | elapsed time per iteration (ms): 5629.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.947575E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:19:32.770669 | finish at 2025-09-10 11:45:47 + [2025-09-09 23:26:20] iteration 4039/ 11920 | consumed samples: 4135936 | elapsed time per iteration (ms): 5618.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.938771E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:17:55.608559 | finish at 2025-09-10 11:44:16 + [2025-09-09 23:26:26] iteration 4040/ 11920 | consumed samples: 4136960 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.941374E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:18:47.615204 | finish at 2025-09-10 11:45:13 + [2025-09-09 23:26:32] iteration 4041/ 11920 | consumed samples: 4137984 | elapsed time per iteration (ms): 5630.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.943937E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:19:25.565436 | finish at 2025-09-10 11:45:57 + [2025-09-09 23:26:37] iteration 4042/ 11920 | consumed samples: 4139008 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930213E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:18:19.465821 | finish at 2025-09-10 11:44:57 + [2025-09-09 23:26:43] iteration 4043/ 11920 | consumed samples: 4140032 | elapsed time per iteration (ms): 5634.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.928456E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:19:40.633596 | finish at 2025-09-10 11:46:23 + [2025-09-09 23:26:49] iteration 4044/ 11920 | consumed samples: 4141056 | elapsed time per iteration (ms): 5946.6 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.929839E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:00:35.505788 | finish at 2025-09-10 12:27:24 + [2025-09-09 23:26:54] iteration 4045/ 11920 | consumed samples: 4142080 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.948309E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:17:53.330569 | finish at 2025-09-10 11:44:48 + [2025-09-09 23:27:00] iteration 4046/ 11920 | consumed samples: 4143104 | elapsed time per iteration (ms): 5631.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.969275E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:18:59.061277 | finish at 2025-09-10 11:45:59 + [2025-09-09 23:27:06] iteration 4047/ 11920 | consumed samples: 4144128 | elapsed time per iteration (ms): 5836.7 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.942150E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:45:52.008441 | finish at 2025-09-10 12:12:58 + [2025-09-09 23:27:12] iteration 4048/ 11920 | consumed samples: 4145152 | elapsed time per iteration (ms): 6540.3 | throughput per GPU (TFLOP/s/GPU): 69.0 | MFU 6.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.953684E+00 | loss scale: 1.0 | grad norm: 0.249 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 14:18:05.569199 | finish at 2025-09-10 13:45:18 + [2025-09-09 23:27:18] iteration 4049/ 11920 | consumed samples: 4146176 | elapsed time per iteration (ms): 5629.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.949295E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:18:32.407903 | finish at 2025-09-10 11:45:50 + [2025-09-09 23:27:24] iteration 4050/ 11920 | consumed samples: 4147200 | elapsed time per iteration (ms): 5622.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.961526E+00 | loss scale: 1.0 | grad norm: 0.249 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:17:29.386024 | finish at 2025-09-10 11:44:53 + [2025-09-09 23:27:29] iteration 4051/ 11920 | consumed samples: 4148224 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.956253E+00 | loss scale: 1.0 | grad norm: 0.278 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:17:18.026323 | finish at 2025-09-10 11:44:47 + [2025-09-09 23:27:35] iteration 4052/ 11920 | consumed samples: 4149248 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.934581E+00 | loss scale: 1.0 | grad norm: 0.274 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:17:43.435276 | finish at 2025-09-10 11:45:18 + [2025-09-09 23:27:40] iteration 4053/ 11920 | consumed samples: 4150272 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.950819E+00 | loss scale: 1.0 | grad norm: 0.245 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:17:17.505730 | finish at 2025-09-10 11:44:58 + [2025-09-09 23:27:46] iteration 4054/ 11920 | consumed samples: 4151296 | elapsed time per iteration (ms): 5632.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.937071E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:18:23.417835 | finish at 2025-09-10 11:46:10 + [2025-09-09 23:27:52] iteration 4055/ 11920 | consumed samples: 4152320 | elapsed time per iteration (ms): 5634.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.945929E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:18:37.716665 | finish at 2025-09-10 11:46:29 + [2025-09-09 23:27:58] iteration 4056/ 11920 | consumed samples: 4153344 | elapsed time per iteration (ms): 6280.0 | throughput per GPU (TFLOP/s/GPU): 71.9 | MFU 7.27% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.957314E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:43:05.876652 | finish at 2025-09-10 13:11:04 + [2025-09-09 23:28:04] iteration 4057/ 11920 | consumed samples: 4154368 | elapsed time per iteration (ms): 5630.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.942987E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:17:52.097203 | finish at 2025-09-10 11:45:56 + [2025-09-09 23:28:09] iteration 4058/ 11920 | consumed samples: 4155392 | elapsed time per iteration (ms): 5633.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.935676E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:18:13.363208 | finish at 2025-09-10 11:46:23 + [2025-09-09 23:28:15] iteration 4059/ 11920 | consumed samples: 4156416 | elapsed time per iteration (ms): 5924.4 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.940813E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:56:11.628526 | finish at 2025-09-10 12:24:27 + [2025-09-09 23:28:21] iteration 4060/ 11920 | consumed samples: 4157440 | elapsed time per iteration (ms): 5629.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.916064E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:17:25.817313 | finish at 2025-09-10 11:45:47 + [2025-09-09 23:28:26] iteration 4061/ 11920 | consumed samples: 4158464 | elapsed time per iteration (ms): 5619.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.935600E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:16:01.009798 | finish at 2025-09-10 11:44:27 + [2025-09-09 23:28:32] iteration 4062/ 11920 | consumed samples: 4159488 | elapsed time per iteration (ms): 5619.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.935545E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:15:59.261271 | finish at 2025-09-10 11:44:31 + [2025-09-09 23:28:38] iteration 4063/ 11920 | consumed samples: 4160512 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.945469E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:16:35.866650 | finish at 2025-09-10 11:45:14 + [2025-09-09 23:28:44] iteration 4064/ 11920 | consumed samples: 4161536 | elapsed time per iteration (ms): 5833.0 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.932663E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:43:43.948296 | finish at 2025-09-10 12:12:27 + [2025-09-09 23:28:49] iteration 4065/ 11920 | consumed samples: 4162560 | elapsed time per iteration (ms): 5931.4 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.950712E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:56:31.211151 | finish at 2025-09-10 12:25:21 + [2025-09-09 23:28:55] iteration 4066/ 11920 | consumed samples: 4163584 | elapsed time per iteration (ms): 5930.9 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.943557E+00 | loss scale: 1.0 | grad norm: 0.276 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:56:21.517811 | finish at 2025-09-10 12:25:17 + [2025-09-09 23:29:01] iteration 4067/ 11920 | consumed samples: 4164608 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.933283E+00 | loss scale: 1.0 | grad norm: 0.288 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:16:19.526397 | finish at 2025-09-10 11:45:21 + [2025-09-09 23:29:07] iteration 4068/ 11920 | consumed samples: 4165632 | elapsed time per iteration (ms): 5617.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.935582E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:15:06.626138 | finish at 2025-09-10 11:44:13 + [2025-09-09 23:29:12] iteration 4069/ 11920 | consumed samples: 4166656 | elapsed time per iteration (ms): 5856.7 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.947453E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:46:21.022386 | finish at 2025-09-10 12:15:34 + [2025-09-09 23:29:18] iteration 4070/ 11920 | consumed samples: 4167680 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930828E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:15:22.383654 | finish at 2025-09-10 11:44:40 + [2025-09-09 23:29:24] iteration 4071/ 11920 | consumed samples: 4168704 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.928878E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:15:34.125330 | finish at 2025-09-10 11:44:58 + [2025-09-09 23:29:29] iteration 4072/ 11920 | consumed samples: 4169728 | elapsed time per iteration (ms): 5633.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.948118E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:16:49.055546 | finish at 2025-09-10 11:46:18 + [2025-09-09 23:29:35] iteration 4073/ 11920 | consumed samples: 4170752 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.936657E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:15:01.038991 | finish at 2025-09-10 11:44:36 + [2025-09-09 23:29:41] iteration 4074/ 11920 | consumed samples: 4171776 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.934693E+00 | loss scale: 1.0 | grad norm: 0.136 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:15:12.886839 | finish at 2025-09-10 11:44:53 + [2025-09-09 23:29:46] iteration 4075/ 11920 | consumed samples: 4172800 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.939365E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:15:00.430080 | finish at 2025-09-10 11:44:47 + [2025-09-09 23:29:52] iteration 4076/ 11920 | consumed samples: 4173824 | elapsed time per iteration (ms): 5633.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.925657E+00 | loss scale: 1.0 | grad norm: 0.135 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:16:28.606251 | finish at 2025-09-10 11:46:20 + [2025-09-09 23:29:57] iteration 4077/ 11920 | consumed samples: 4174848 | elapsed time per iteration (ms): 5619.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.935846E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:14:30.682442 | finish at 2025-09-10 11:44:28 + [2025-09-09 23:30:03] iteration 4078/ 11920 | consumed samples: 4175872 | elapsed time per iteration (ms): 5953.5 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.939111E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:58:07.730538 | finish at 2025-09-10 12:28:11 + [2025-09-09 23:30:09] iteration 4079/ 11920 | consumed samples: 4176896 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.935157E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:14:22.741914 | finish at 2025-09-10 11:44:32 + [2025-09-09 23:30:15] iteration 4080/ 11920 | consumed samples: 4177920 | elapsed time per iteration (ms): 6016.5 | throughput per GPU (TFLOP/s/GPU): 75.0 | MFU 7.59% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930592E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:06:09.288940 | finish at 2025-09-10 12:36:24 + [2025-09-09 23:30:21] iteration 4081/ 11920 | consumed samples: 4178944 | elapsed time per iteration (ms): 5632.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.924860E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:15:56.682497 | finish at 2025-09-10 11:46:17 + [2025-09-09 23:30:27] iteration 4082/ 11920 | consumed samples: 4179968 | elapsed time per iteration (ms): 5840.2 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.953863E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:42:55.563805 | finish at 2025-09-10 12:13:22 + [2025-09-09 23:30:32] iteration 4083/ 11920 | consumed samples: 4180992 | elapsed time per iteration (ms): 5628.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930541E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:15:13.977448 | finish at 2025-09-10 11:45:46 + [2025-09-09 23:30:38] iteration 4084/ 11920 | consumed samples: 4182016 | elapsed time per iteration (ms): 5966.4 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.926199E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:59:12.803092 | finish at 2025-09-10 12:29:51 + [2025-09-09 23:30:44] iteration 4085/ 11920 | consumed samples: 4183040 | elapsed time per iteration (ms): 5618.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.932347E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:13:40.781202 | finish at 2025-09-10 11:44:25 + [2025-09-09 23:30:49] iteration 4086/ 11920 | consumed samples: 4184064 | elapsed time per iteration (ms): 5633.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.937329E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:15:35.428508 | finish at 2025-09-10 11:46:25 + [2025-09-09 23:30:55] iteration 4087/ 11920 | consumed samples: 4185088 | elapsed time per iteration (ms): 5852.5 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.945708E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:44:02.275502 | finish at 2025-09-10 12:14:58 + [2025-09-09 23:31:01] iteration 4088/ 11920 | consumed samples: 4186112 | elapsed time per iteration (ms): 5631.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.936311E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:15:07.372004 | finish at 2025-09-10 11:46:08 + [2025-09-09 23:31:06] iteration 4089/ 11920 | consumed samples: 4187136 | elapsed time per iteration (ms): 5616.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.932306E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:13:04.166206 | finish at 2025-09-10 11:44:11 + [2025-09-09 23:31:12] iteration 4090/ 11920 | consumed samples: 4188160 | elapsed time per iteration (ms): 5617.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.921372E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:13:08.186045 | finish at 2025-09-10 11:44:20 + [2025-09-09 23:31:18] iteration 4091/ 11920 | consumed samples: 4189184 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.957097E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:13:31.753971 | finish at 2025-09-10 11:44:49 + [2025-09-09 23:31:24] iteration 4092/ 11920 | consumed samples: 4190208 | elapsed time per iteration (ms): 5888.4 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930261E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:48:14.466730 | finish at 2025-09-10 12:19:38 + [2025-09-09 23:31:29] iteration 4093/ 11920 | consumed samples: 4191232 | elapsed time per iteration (ms): 5639.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.941504E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:15:40.862122 | finish at 2025-09-10 11:47:10 + [2025-09-09 23:31:35] iteration 4094/ 11920 | consumed samples: 4192256 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.931365E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:13:37.087258 | finish at 2025-09-10 11:45:12 + [2025-09-09 23:31:41] iteration 4095/ 11920 | consumed samples: 4193280 | elapsed time per iteration (ms): 5961.0 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.939208E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:57:24.858313 | finish at 2025-09-10 12:29:06 + [2025-09-09 23:31:46] iteration 4096/ 11920 | consumed samples: 4194304 | elapsed time per iteration (ms): 5628.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.939642E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:13:54.968204 | finish at 2025-09-10 11:45:41 + [2025-09-09 23:31:52] iteration 4097/ 11920 | consumed samples: 4195328 | elapsed time per iteration (ms): 5628.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.938502E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:13:52.462272 | finish at 2025-09-10 11:45:45 + [2025-09-09 23:31:58] iteration 4098/ 11920 | consumed samples: 4196352 | elapsed time per iteration (ms): 5628.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.922608E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:13:44.148211 | finish at 2025-09-10 11:45:42 + [2025-09-09 23:32:04] iteration 4099/ 11920 | consumed samples: 4197376 | elapsed time per iteration (ms): 5879.3 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.942141E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:46:21.998760 | finish at 2025-09-10 12:18:26 + [2025-09-09 23:32:09] iteration 4100/ 11920 | consumed samples: 4198400 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.956228E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:12:35.103607 | finish at 2025-09-10 11:44:44 + [2025-09-09 23:32:15] iteration 4101/ 11920 | consumed samples: 4199424 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.931642E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:12:34.135780 | finish at 2025-09-10 11:44:49 + [2025-09-09 23:32:20] iteration 4102/ 11920 | consumed samples: 4200448 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930803E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:12:46.063478 | finish at 2025-09-10 11:45:07 + [2025-09-09 23:32:26] iteration 4103/ 11920 | consumed samples: 4201472 | elapsed time per iteration (ms): 5661.2 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.917285E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:17:33.966053 | finish at 2025-09-10 11:50:00 + [2025-09-09 23:32:32] iteration 4104/ 11920 | consumed samples: 4202496 | elapsed time per iteration (ms): 5630.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.931983E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:13:28.325901 | finish at 2025-09-10 11:46:00 + [2025-09-09 23:32:37] iteration 4105/ 11920 | consumed samples: 4203520 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.935214E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:12:45.847900 | finish at 2025-09-10 11:45:23 + [2025-09-09 23:32:43] iteration 4106/ 11920 | consumed samples: 4204544 | elapsed time per iteration (ms): 5627.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.936602E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:12:54.583960 | finish at 2025-09-10 11:45:38 + [2025-09-09 23:32:49] iteration 4107/ 11920 | consumed samples: 4205568 | elapsed time per iteration (ms): 5902.4 | throughput per GPU (TFLOP/s/GPU): 76.5 | MFU 7.73% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.937558E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:48:35.766138 | finish at 2025-09-10 12:21:25 + [2025-09-09 23:32:55] iteration 4108/ 11920 | consumed samples: 4206592 | elapsed time per iteration (ms): 5625.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.934121E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:12:29.860703 | finish at 2025-09-10 11:45:24 + [2025-09-09 23:33:01] iteration 4109/ 11920 | consumed samples: 4207616 | elapsed time per iteration (ms): 5972.0 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.938703E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:57:27.333924 | finish at 2025-09-10 12:30:28 + [2025-09-09 23:33:06] iteration 4110/ 11920 | consumed samples: 4208640 | elapsed time per iteration (ms): 5626.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.932671E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:12:18.722403 | finish at 2025-09-10 11:45:25 + [2025-09-09 23:33:12] iteration 4111/ 11920 | consumed samples: 4209664 | elapsed time per iteration (ms): 6067.4 | throughput per GPU (TFLOP/s/GPU): 74.4 | MFU 7.52% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.928947E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:09:40.257546 | finish at 2025-09-10 12:42:52 + [2025-09-09 23:33:18] iteration 4112/ 11920 | consumed samples: 4210688 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.944024E+00 | loss scale: 1.0 | grad norm: 0.258 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:12:08.211395 | finish at 2025-09-10 11:45:26 + [2025-09-09 23:33:23] iteration 4113/ 11920 | consumed samples: 4211712 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.942303E+00 | loss scale: 1.0 | grad norm: 0.272 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:12:12.409464 | finish at 2025-09-10 11:45:36 + [2025-09-09 23:33:29] iteration 4114/ 11920 | consumed samples: 4212736 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.939699E+00 | loss scale: 1.0 | grad norm: 0.252 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:11:36.183884 | finish at 2025-09-10 11:45:05 + [2025-09-09 23:33:35] iteration 4115/ 11920 | consumed samples: 4213760 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.947204E+00 | loss scale: 1.0 | grad norm: 0.247 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:11:48.854579 | finish at 2025-09-10 11:45:24 + [2025-09-09 23:33:40] iteration 4116/ 11920 | consumed samples: 4214784 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.944864E+00 | loss scale: 1.0 | grad norm: 0.256 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:11:37.111131 | finish at 2025-09-10 11:45:17 + [2025-09-09 23:33:46] iteration 4117/ 11920 | consumed samples: 4215808 | elapsed time per iteration (ms): 5627.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.945007E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:11:47.412895 | finish at 2025-09-10 11:45:33 + [2025-09-09 23:33:52] iteration 4118/ 11920 | consumed samples: 4216832 | elapsed time per iteration (ms): 5846.0 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.942969E+00 | loss scale: 1.0 | grad norm: 0.245 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:40:10.694413 | finish at 2025-09-10 12:14:03 + [2025-09-09 23:33:57] iteration 4119/ 11920 | consumed samples: 4217856 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.938362E+00 | loss scale: 1.0 | grad norm: 0.257 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:11:24.889758 | finish at 2025-09-10 11:45:22 + [2025-09-09 23:34:03] iteration 4120/ 11920 | consumed samples: 4218880 | elapsed time per iteration (ms): 5633.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.944491E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:12:19.842796 | finish at 2025-09-10 11:46:23 + [2025-09-09 23:34:09] iteration 4121/ 11920 | consumed samples: 4219904 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.932999E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:10:31.206552 | finish at 2025-09-10 11:44:40 + [2025-09-09 23:34:14] iteration 4122/ 11920 | consumed samples: 4220928 | elapsed time per iteration (ms): 5624.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.933651E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:11:00.306784 | finish at 2025-09-10 11:45:15 + [2025-09-09 23:34:20] iteration 4123/ 11920 | consumed samples: 4221952 | elapsed time per iteration (ms): 5627.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.952481E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:11:15.416950 | finish at 2025-09-10 11:45:35 + [2025-09-09 23:34:26] iteration 4124/ 11920 | consumed samples: 4222976 | elapsed time per iteration (ms): 5637.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.935758E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:12:25.783141 | finish at 2025-09-10 11:46:51 + [2025-09-09 23:34:31] iteration 4125/ 11920 | consumed samples: 4224000 | elapsed time per iteration (ms): 5853.5 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.928501E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:40:28.044647 | finish at 2025-09-10 12:14:59 + [2025-09-09 23:34:37] iteration 4126/ 11920 | consumed samples: 4225024 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.955562E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:10:42.799767 | finish at 2025-09-10 11:45:20 + [2025-09-09 23:34:43] iteration 4127/ 11920 | consumed samples: 4226048 | elapsed time per iteration (ms): 5930.8 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.938247E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:50:19.071542 | finish at 2025-09-10 12:25:02 + [2025-09-09 23:34:49] iteration 4128/ 11920 | consumed samples: 4227072 | elapsed time per iteration (ms): 5628.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.938074E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:10:57.164131 | finish at 2025-09-10 11:45:46 + [2025-09-09 23:34:54] iteration 4129/ 11920 | consumed samples: 4228096 | elapsed time per iteration (ms): 5617.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.935844E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:09:21.685586 | finish at 2025-09-10 11:44:16 + [2025-09-09 23:35:00] iteration 4130/ 11920 | consumed samples: 4229120 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.924739E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:10:18.651564 | finish at 2025-09-10 11:45:19 + [2025-09-09 23:35:05] iteration 4131/ 11920 | consumed samples: 4230144 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.926878E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:10:11.605939 | finish at 2025-09-10 11:45:17 + [2025-09-09 23:35:11] iteration 4132/ 11920 | consumed samples: 4231168 | elapsed time per iteration (ms): 5617.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930207E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:09:11.613916 | finish at 2025-09-10 11:44:23 + [2025-09-09 23:35:17] iteration 4133/ 11920 | consumed samples: 4232192 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.940307E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:09:33.902128 | finish at 2025-09-10 11:44:51 + [2025-09-09 23:35:22] iteration 4134/ 11920 | consumed samples: 4233216 | elapsed time per iteration (ms): 5619.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.935656E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:09:11.248919 | finish at 2025-09-10 11:44:34 + [2025-09-09 23:35:28] iteration 4135/ 11920 | consumed samples: 4234240 | elapsed time per iteration (ms): 5626.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.942274E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:10:02.000967 | finish at 2025-09-10 11:45:30 + [2025-09-09 23:35:34] iteration 4136/ 11920 | consumed samples: 4235264 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.934374E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:09:20.398849 | finish at 2025-09-10 11:44:54 + [2025-09-09 23:35:39] iteration 4137/ 11920 | consumed samples: 4236288 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.947712E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:09:41.039484 | finish at 2025-09-10 11:45:20 + [2025-09-09 23:35:45] iteration 4138/ 11920 | consumed samples: 4237312 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.952016E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:08:55.076597 | finish at 2025-09-10 11:44:40 + [2025-09-09 23:35:50] iteration 4139/ 11920 | consumed samples: 4238336 | elapsed time per iteration (ms): 5627.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.936305E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:09:48.950745 | finish at 2025-09-10 11:45:39 + [2025-09-09 23:35:56] iteration 4140/ 11920 | consumed samples: 4239360 | elapsed time per iteration (ms): 5627.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.928701E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:09:38.201699 | finish at 2025-09-10 11:45:34 + [2025-09-09 23:36:02] iteration 4141/ 11920 | consumed samples: 4240384 | elapsed time per iteration (ms): 5629.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.938031E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:09:53.973727 | finish at 2025-09-10 11:45:56 + [2025-09-09 23:36:07] iteration 4142/ 11920 | consumed samples: 4241408 | elapsed time per iteration (ms): 5617.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.945525E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:08:12.603977 | finish at 2025-09-10 11:44:20 + [2025-09-09 23:36:13] iteration 4143/ 11920 | consumed samples: 4242432 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.941141E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:08:29.132858 | finish at 2025-09-10 11:44:42 + [2025-09-09 23:36:19] iteration 4144/ 11920 | consumed samples: 4243456 | elapsed time per iteration (ms): 5625.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.933567E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:09:07.224815 | finish at 2025-09-10 11:45:26 + [2025-09-09 23:36:24] iteration 4145/ 11920 | consumed samples: 4244480 | elapsed time per iteration (ms): 5632.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.942646E+00 | loss scale: 1.0 | grad norm: 0.230 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:09:50.188187 | finish at 2025-09-10 11:46:14 + [2025-09-09 23:36:30] iteration 4146/ 11920 | consumed samples: 4245504 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.934096E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:08:54.542081 | finish at 2025-09-10 11:45:24 + [2025-09-09 23:36:35] iteration 4147/ 11920 | consumed samples: 4246528 | elapsed time per iteration (ms): 5628.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.934829E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:09:09.887460 | finish at 2025-09-10 11:45:45 + [2025-09-09 23:36:41] iteration 4148/ 11920 | consumed samples: 4247552 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.920436E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:08:37.242435 | finish at 2025-09-10 11:45:18 + [2025-09-09 23:36:47] iteration 4149/ 11920 | consumed samples: 4248576 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.933509E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:07:55.416570 | finish at 2025-09-10 11:44:42 + [2025-09-09 23:36:53] iteration 4150/ 11920 | consumed samples: 4249600 | elapsed time per iteration (ms): 5865.6 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.935856E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:39:35.688765 | finish at 2025-09-10 12:16:28 + [2025-09-09 23:36:58] iteration 4151/ 11920 | consumed samples: 4250624 | elapsed time per iteration (ms): 5625.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.932579E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:08:23.651616 | finish at 2025-09-10 11:45:22 + [2025-09-09 23:37:04] iteration 4152/ 11920 | consumed samples: 4251648 | elapsed time per iteration (ms): 5643.8 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.924017E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:10:40.838537 | finish at 2025-09-10 11:47:45 + [2025-09-09 23:37:09] iteration 4153/ 11920 | consumed samples: 4252672 | elapsed time per iteration (ms): 5628.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.929982E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:08:38.859313 | finish at 2025-09-10 11:45:48 + [2025-09-09 23:37:15] iteration 4154/ 11920 | consumed samples: 4253696 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.932672E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:07:58.869291 | finish at 2025-09-10 11:45:14 + [2025-09-09 23:37:21] iteration 4155/ 11920 | consumed samples: 4254720 | elapsed time per iteration (ms): 5634.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.929505E+00 | loss scale: 1.0 | grad norm: 0.136 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:09:13.094214 | finish at 2025-09-10 11:46:34 + [2025-09-09 23:37:26] iteration 4156/ 11920 | consumed samples: 4255744 | elapsed time per iteration (ms): 5630.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.927318E+00 | loss scale: 1.0 | grad norm: 0.131 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:08:35.624654 | finish at 2025-09-10 11:46:02 + [2025-09-09 23:37:32] iteration 4157/ 11920 | consumed samples: 4256768 | elapsed time per iteration (ms): 5615.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.931634E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:06:36.169079 | finish at 2025-09-10 11:44:08 + [2025-09-09 23:37:38] iteration 4158/ 11920 | consumed samples: 4257792 | elapsed time per iteration (ms): 5625.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.941578E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:07:44.462650 | finish at 2025-09-10 11:45:22 + [2025-09-09 23:37:43] iteration 4159/ 11920 | consumed samples: 4258816 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.927391E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:07:08.265480 | finish at 2025-09-10 11:44:52 + [2025-09-09 23:37:49] iteration 4160/ 11920 | consumed samples: 4259840 | elapsed time per iteration (ms): 5640.6 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.910362E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:09:30.711613 | finish at 2025-09-10 11:47:20 + [2025-09-09 23:37:55] iteration 4161/ 11920 | consumed samples: 4260864 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.934018E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:07:25.305495 | finish at 2025-09-10 11:45:20 + [2025-09-09 23:38:00] iteration 4162/ 11920 | consumed samples: 4261888 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.925220E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:07:24.718825 | finish at 2025-09-10 11:45:25 + [2025-09-09 23:38:06] iteration 4163/ 11920 | consumed samples: 4262912 | elapsed time per iteration (ms): 5835.2 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.932385E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:34:23.928064 | finish at 2025-09-10 12:12:30 + [2025-09-09 23:38:12] iteration 4164/ 11920 | consumed samples: 4263936 | elapsed time per iteration (ms): 6226.3 | throughput per GPU (TFLOP/s/GPU): 72.5 | MFU 7.33% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.919896E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:24:51.489772 | finish at 2025-09-10 13:03:04 + [2025-09-09 23:38:18] iteration 4165/ 11920 | consumed samples: 4264960 | elapsed time per iteration (ms): 5918.8 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.928419E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:45:00.178413 | finish at 2025-09-10 12:23:18 + [2025-09-09 23:38:24] iteration 4166/ 11920 | consumed samples: 4265984 | elapsed time per iteration (ms): 5617.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.926964E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:06:00.293619 | finish at 2025-09-10 11:44:24 + [2025-09-09 23:38:29] iteration 4167/ 11920 | consumed samples: 4267008 | elapsed time per iteration (ms): 5634.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.927326E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:08:06.104985 | finish at 2025-09-10 11:46:35 + [2025-09-09 23:38:35] iteration 4168/ 11920 | consumed samples: 4268032 | elapsed time per iteration (ms): 5633.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.923747E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:07:50.702402 | finish at 2025-09-10 11:46:26 + [2025-09-09 23:38:41] iteration 4169/ 11920 | consumed samples: 4269056 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.935814E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:06:09.664987 | finish at 2025-09-10 11:44:50 + [2025-09-09 23:38:46] iteration 4170/ 11920 | consumed samples: 4270080 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.934833E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:05:55.658758 | finish at 2025-09-10 11:44:42 + [2025-09-09 23:38:52] iteration 4171/ 11920 | consumed samples: 4271104 | elapsed time per iteration (ms): 5636.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.933261E+00 | loss scale: 1.0 | grad norm: 0.230 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:07:59.801921 | finish at 2025-09-10 11:46:52 + [2025-09-09 23:38:58] iteration 4172/ 11920 | consumed samples: 4272128 | elapsed time per iteration (ms): 5840.0 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.947636E+00 | loss scale: 1.0 | grad norm: 0.250 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:34:08.566869 | finish at 2025-09-10 12:13:06 + [2025-09-09 23:39:04] iteration 4173/ 11920 | consumed samples: 4273152 | elapsed time per iteration (ms): 5944.6 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.944241E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:47:33.001719 | finish at 2025-09-10 12:26:37 + [2025-09-09 23:39:09] iteration 4174/ 11920 | consumed samples: 4274176 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.950562E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:05:47.092136 | finish at 2025-09-10 11:44:56 + [2025-09-09 23:39:15] iteration 4175/ 11920 | consumed samples: 4275200 | elapsed time per iteration (ms): 6116.0 | throughput per GPU (TFLOP/s/GPU): 73.8 | MFU 7.46% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.923630E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:09:28.209006 | finish at 2025-09-10 12:48:44 + [2025-09-09 23:39:21] iteration 4176/ 11920 | consumed samples: 4276224 | elapsed time per iteration (ms): 5912.7 | throughput per GPU (TFLOP/s/GPU): 76.4 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930187E+00 | loss scale: 1.0 | grad norm: 0.248 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:43:08.040634 | finish at 2025-09-10 12:22:29 + [2025-09-09 23:39:27] iteration 4177/ 11920 | consumed samples: 4277248 | elapsed time per iteration (ms): 5973.9 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.935344E+00 | loss scale: 1.0 | grad norm: 0.259 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:50:55.876934 | finish at 2025-09-10 12:30:23 + [2025-09-09 23:39:33] iteration 4178/ 11920 | consumed samples: 4278272 | elapsed time per iteration (ms): 5629.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.940711E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:06:26.689326 | finish at 2025-09-10 11:46:00 + [2025-09-09 23:39:39] iteration 4179/ 11920 | consumed samples: 4279296 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.936724E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:05:47.943857 | finish at 2025-09-10 11:45:26 + [2025-09-09 23:39:44] iteration 4180/ 11920 | consumed samples: 4280320 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.931233E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:05:33.462353 | finish at 2025-09-10 11:45:18 + [2025-09-09 23:39:50] iteration 4181/ 11920 | consumed samples: 4281344 | elapsed time per iteration (ms): 6124.2 | throughput per GPU (TFLOP/s/GPU): 73.7 | MFU 7.45% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.938702E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:09:55.444780 | finish at 2025-09-10 12:49:46 + [2025-09-09 23:39:56] iteration 4182/ 11920 | consumed samples: 4282368 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.927617E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:04:58.012221 | finish at 2025-09-10 11:44:54 + [2025-09-09 23:40:02] iteration 4183/ 11920 | consumed samples: 4283392 | elapsed time per iteration (ms): 5626.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.937036E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:05:31.344229 | finish at 2025-09-10 11:45:33 + [2025-09-09 23:40:07] iteration 4184/ 11920 | consumed samples: 4284416 | elapsed time per iteration (ms): 5821.6 | throughput per GPU (TFLOP/s/GPU): 77.6 | MFU 7.84% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.925005E+00 | loss scale: 1.0 | grad norm: 0.133 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:30:35.755274 | finish at 2025-09-10 12:10:43 + [2025-09-09 23:40:13] iteration 4185/ 11920 | consumed samples: 4285440 | elapsed time per iteration (ms): 5635.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.932841E+00 | loss scale: 1.0 | grad norm: 0.126 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:06:32.390209 | finish at 2025-09-10 11:46:45 + [2025-09-09 23:40:19] iteration 4186/ 11920 | consumed samples: 4286464 | elapsed time per iteration (ms): 5628.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.935765E+00 | loss scale: 1.0 | grad norm: 0.115 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:05:30.416905 | finish at 2025-09-10 11:45:49 + [2025-09-09 23:40:24] iteration 4187/ 11920 | consumed samples: 4287488 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.922563E+00 | loss scale: 1.0 | grad norm: 0.121 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:04:17.788731 | finish at 2025-09-10 11:44:42 + [2025-09-09 23:40:30] iteration 4188/ 11920 | consumed samples: 4288512 | elapsed time per iteration (ms): 6157.6 | throughput per GPU (TFLOP/s/GPU): 73.3 | MFU 7.41% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.934248E+00 | loss scale: 1.0 | grad norm: 0.130 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:13:30.595810 | finish at 2025-09-10 12:54:01 + [2025-09-09 23:40:36] iteration 4189/ 11920 | consumed samples: 4289536 | elapsed time per iteration (ms): 5624.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.931508E+00 | loss scale: 1.0 | grad norm: 0.142 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:04:43.413444 | finish at 2025-09-10 11:45:19 + [2025-09-09 23:40:42] iteration 4190/ 11920 | consumed samples: 4290560 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.925444E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:04:23.222013 | finish at 2025-09-10 11:45:05 + [2025-09-09 23:40:47] iteration 4191/ 11920 | consumed samples: 4291584 | elapsed time per iteration (ms): 5627.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.922007E+00 | loss scale: 1.0 | grad norm: 0.142 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:04:57.771016 | finish at 2025-09-10 11:45:45 + [2025-09-09 23:40:53] iteration 4192/ 11920 | consumed samples: 4292608 | elapsed time per iteration (ms): 5635.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.913401E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:05:46.907742 | finish at 2025-09-10 11:46:40 + [2025-09-09 23:40:59] iteration 4193/ 11920 | consumed samples: 4293632 | elapsed time per iteration (ms): 5977.5 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.934394E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:49:48.179050 | finish at 2025-09-10 12:30:47 + [2025-09-09 23:41:05] iteration 4194/ 11920 | consumed samples: 4294656 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.928098E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:04:20.304667 | finish at 2025-09-10 11:45:25 + [2025-09-09 23:41:10] iteration 4195/ 11920 | consumed samples: 4295680 | elapsed time per iteration (ms): 5622.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.923213E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:03:54.410638 | finish at 2025-09-10 11:45:05 + [2025-09-09 23:41:16] iteration 4196/ 11920 | consumed samples: 4296704 | elapsed time per iteration (ms): 5617.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.927751E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:03:11.358656 | finish at 2025-09-10 11:44:27 + [2025-09-09 23:41:21] iteration 4197/ 11920 | consumed samples: 4297728 | elapsed time per iteration (ms): 5616.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930722E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:02:58.465922 | finish at 2025-09-10 11:44:20 + [2025-09-09 23:41:27] iteration 4198/ 11920 | consumed samples: 4298752 | elapsed time per iteration (ms): 5851.8 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.922721E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:33:07.424428 | finish at 2025-09-10 12:14:35 + [2025-09-09 23:41:33] iteration 4199/ 11920 | consumed samples: 4299776 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.926533E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:03:32.581186 | finish at 2025-09-10 11:45:05 + [2025-09-09 23:41:38] iteration 4200/ 11920 | consumed samples: 4300800 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.944829E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:03:36.908760 | finish at 2025-09-10 11:45:15 + [2025-09-09 23:41:44] iteration 4201/ 11920 | consumed samples: 4301824 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.929296E+00 | loss scale: 1.0 | grad norm: 0.284 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:03:34.770437 | finish at 2025-09-10 11:45:19 + [2025-09-09 23:41:50] iteration 4202/ 11920 | consumed samples: 4302848 | elapsed time per iteration (ms): 5630.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.921922E+00 | loss scale: 1.0 | grad norm: 0.291 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:04:19.143787 | finish at 2025-09-10 11:46:09 + [2025-09-09 23:41:56] iteration 4203/ 11920 | consumed samples: 4303872 | elapsed time per iteration (ms): 5867.5 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.939493E+00 | loss scale: 1.0 | grad norm: 0.240 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:34:39.420740 | finish at 2025-09-10 12:16:35 + [2025-09-09 23:42:01] iteration 4204/ 11920 | consumed samples: 4304896 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.927201E+00 | loss scale: 1.0 | grad norm: 0.247 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:03:10.632497 | finish at 2025-09-10 11:45:12 + [2025-09-09 23:42:07] iteration 4205/ 11920 | consumed samples: 4305920 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.934695E+00 | loss scale: 1.0 | grad norm: 0.280 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:03:13.082159 | finish at 2025-09-10 11:45:20 + [2025-09-09 23:42:13] iteration 4206/ 11920 | consumed samples: 4306944 | elapsed time per iteration (ms): 5976.5 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.939860E+00 | loss scale: 1.0 | grad norm: 0.306 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:48:22.688160 | finish at 2025-09-10 12:30:36 + [2025-09-09 23:42:18] iteration 4207/ 11920 | consumed samples: 4307968 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.961370E+00 | loss scale: 1.0 | grad norm: 0.332 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:03:03.986520 | finish at 2025-09-10 11:45:22 + [2025-09-09 23:42:24] iteration 4208/ 11920 | consumed samples: 4308992 | elapsed time per iteration (ms): 5631.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.947264E+00 | loss scale: 1.0 | grad norm: 0.316 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:03:46.316452 | finish at 2025-09-10 11:46:10 + [2025-09-09 23:42:30] iteration 4209/ 11920 | consumed samples: 4310016 | elapsed time per iteration (ms): 5629.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.948801E+00 | loss scale: 1.0 | grad norm: 0.295 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:03:26.757382 | finish at 2025-09-10 11:45:56 + [2025-09-09 23:42:36] iteration 4210/ 11920 | consumed samples: 4311040 | elapsed time per iteration (ms): 5858.5 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.950819E+00 | loss scale: 1.0 | grad norm: 0.269 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:32:48.665550 | finish at 2025-09-10 12:15:24 + [2025-09-09 23:42:41] iteration 4211/ 11920 | consumed samples: 4312064 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.956328E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:02:42.318132 | finish at 2025-09-10 11:45:24 + [2025-09-09 23:42:47] iteration 4212/ 11920 | consumed samples: 4313088 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.938665E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:02:27.113148 | finish at 2025-09-10 11:45:14 + [2025-09-09 23:42:52] iteration 4213/ 11920 | consumed samples: 4314112 | elapsed time per iteration (ms): 5642.7 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.916566E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:04:48.634016 | finish at 2025-09-10 11:47:41 + [2025-09-09 23:42:58] iteration 4214/ 11920 | consumed samples: 4315136 | elapsed time per iteration (ms): 5630.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.932619E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:03:10.500252 | finish at 2025-09-10 11:46:09 + [2025-09-09 23:43:04] iteration 4215/ 11920 | consumed samples: 4316160 | elapsed time per iteration (ms): 6008.6 | throughput per GPU (TFLOP/s/GPU): 75.1 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.933945E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:51:35.941496 | finish at 2025-09-10 12:34:40 + [2025-09-09 23:43:10] iteration 4216/ 11920 | consumed samples: 4317184 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.939391E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:02:05.836321 | finish at 2025-09-10 11:45:16 + [2025-09-09 23:43:15] iteration 4217/ 11920 | consumed samples: 4318208 | elapsed time per iteration (ms): 5632.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.918376E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:03:07.817322 | finish at 2025-09-10 11:46:23 + [2025-09-09 23:43:21] iteration 4218/ 11920 | consumed samples: 4319232 | elapsed time per iteration (ms): 5626.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.928129E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:02:17.344128 | finish at 2025-09-10 11:45:38 + [2025-09-09 23:43:27] iteration 4219/ 11920 | consumed samples: 4320256 | elapsed time per iteration (ms): 5630.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.937344E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:02:43.286604 | finish at 2025-09-10 11:46:10 + [2025-09-09 23:43:32] iteration 4220/ 11920 | consumed samples: 4321280 | elapsed time per iteration (ms): 5626.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.941337E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:02:04.695373 | finish at 2025-09-10 11:45:37 + [2025-09-09 23:43:38] iteration 4221/ 11920 | consumed samples: 4322304 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.925550E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:01:56.880772 | finish at 2025-09-10 11:45:35 + [2025-09-09 23:43:43] iteration 4222/ 11920 | consumed samples: 4323328 | elapsed time per iteration (ms): 5623.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.932023E+00 | loss scale: 1.0 | grad norm: 0.115 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:01:26.398378 | finish at 2025-09-10 11:45:10 + [2025-09-09 23:43:50] iteration 4223/ 11920 | consumed samples: 4324352 | elapsed time per iteration (ms): 6057.6 | throughput per GPU (TFLOP/s/GPU): 74.5 | MFU 7.54% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.919784E+00 | loss scale: 1.0 | grad norm: 0.120 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:57:05.628136 | finish at 2025-09-10 12:40:55 + [2025-09-09 23:43:55] iteration 4224/ 11920 | consumed samples: 4325376 | elapsed time per iteration (ms): 5626.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.924082E+00 | loss scale: 1.0 | grad norm: 0.116 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:01:44.423908 | finish at 2025-09-10 11:45:40 + [2025-09-09 23:44:01] iteration 4225/ 11920 | consumed samples: 4326400 | elapsed time per iteration (ms): 5622.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.925117E+00 | loss scale: 1.0 | grad norm: 0.126 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:01:06.006675 | finish at 2025-09-10 11:45:07 + [2025-09-09 23:44:06] iteration 4226/ 11920 | consumed samples: 4327424 | elapsed time per iteration (ms): 5625.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.922292E+00 | loss scale: 1.0 | grad norm: 0.136 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:01:21.963856 | finish at 2025-09-10 11:45:28 + [2025-09-09 23:44:12] iteration 4227/ 11920 | consumed samples: 4328448 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.933254E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:00:35.634890 | finish at 2025-09-10 11:44:48 + [2025-09-09 23:44:18] iteration 4228/ 11920 | consumed samples: 4329472 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.928413E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:01:15.965355 | finish at 2025-09-10 11:45:34 + [2025-09-09 23:44:23] iteration 4229/ 11920 | consumed samples: 4330496 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.914888E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:00:49.948763 | finish at 2025-09-10 11:45:13 + [2025-09-09 23:44:29] iteration 4230/ 11920 | consumed samples: 4331520 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.932903E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:00:53.778524 | finish at 2025-09-10 11:45:23 + [2025-09-09 23:44:35] iteration 4231/ 11920 | consumed samples: 4332544 | elapsed time per iteration (ms): 5959.9 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.932038E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:43:45.754676 | finish at 2025-09-10 12:28:21 + [2025-09-09 23:44:41] iteration 4232/ 11920 | consumed samples: 4333568 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.949535E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:00:45.647036 | finish at 2025-09-10 11:45:26 + [2025-09-09 23:44:46] iteration 4233/ 11920 | consumed samples: 4334592 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.942178E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:00:07.254686 | finish at 2025-09-10 11:44:53 + [2025-09-09 23:44:52] iteration 4234/ 11920 | consumed samples: 4335616 | elapsed time per iteration (ms): 6132.5 | throughput per GPU (TFLOP/s/GPU): 73.6 | MFU 7.44% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930804E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:05:34.099457 | finish at 2025-09-10 12:50:26 + [2025-09-09 23:44:58] iteration 4235/ 11920 | consumed samples: 4336640 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.928899E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:00:05.592029 | finish at 2025-09-10 11:45:03 + [2025-09-09 23:45:04] iteration 4236/ 11920 | consumed samples: 4337664 | elapsed time per iteration (ms): 5953.6 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.936311E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 13.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:42:27.274996 | finish at 2025-09-10 12:27:31 + [2025-09-09 23:45:09] iteration 4237/ 11920 | consumed samples: 4338688 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.942597E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:59:46.980515 | finish at 2025-09-10 11:44:56 + [2025-09-09 23:45:15] iteration 4238/ 11920 | consumed samples: 4339712 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.920954E+00 | loss scale: 1.0 | grad norm: 0.247 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:59:37.868506 | finish at 2025-09-10 11:44:53 + [2025-09-09 23:45:21] iteration 4239/ 11920 | consumed samples: 4340736 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.923135E+00 | loss scale: 1.0 | grad norm: 0.267 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:59:32.987694 | finish at 2025-09-10 11:44:54 + [2025-09-09 23:45:26] iteration 4240/ 11920 | consumed samples: 4341760 | elapsed time per iteration (ms): 5628.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.926940E+00 | loss scale: 1.0 | grad norm: 0.282 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:00:23.170166 | finish at 2025-09-10 11:45:50 + [2025-09-09 23:45:32] iteration 4241/ 11920 | consumed samples: 4342784 | elapsed time per iteration (ms): 5626.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.939806E+00 | loss scale: 1.0 | grad norm: 0.273 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:00:03.847643 | finish at 2025-09-10 11:45:36 + [2025-09-09 23:45:38] iteration 4242/ 11920 | consumed samples: 4343808 | elapsed time per iteration (ms): 5626.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.947374E+00 | loss scale: 1.0 | grad norm: 0.288 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:59:58.170154 | finish at 2025-09-10 11:45:36 + [2025-09-09 23:45:43] iteration 4243/ 11920 | consumed samples: 4344832 | elapsed time per iteration (ms): 5629.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930965E+00 | loss scale: 1.0 | grad norm: 0.249 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:00:16.614721 | finish at 2025-09-10 11:46:00 + [2025-09-09 23:45:49] iteration 4244/ 11920 | consumed samples: 4345856 | elapsed time per iteration (ms): 5632.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.939038E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:00:35.995519 | finish at 2025-09-10 11:46:25 + [2025-09-09 23:45:54] iteration 4245/ 11920 | consumed samples: 4346880 | elapsed time per iteration (ms): 5638.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.926198E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:01:12.261262 | finish at 2025-09-10 11:47:07 + [2025-09-09 23:46:00] iteration 4246/ 11920 | consumed samples: 4347904 | elapsed time per iteration (ms): 5625.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.940578E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:59:29.385976 | finish at 2025-09-10 11:45:29 + [2025-09-09 23:46:06] iteration 4247/ 11920 | consumed samples: 4348928 | elapsed time per iteration (ms): 5626.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.943172E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:59:33.434359 | finish at 2025-09-10 11:45:39 + [2025-09-09 23:46:11] iteration 4248/ 11920 | consumed samples: 4349952 | elapsed time per iteration (ms): 5619.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.931093E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:58:30.634031 | finish at 2025-09-10 11:44:42 + [2025-09-09 23:46:17] iteration 4249/ 11920 | consumed samples: 4350976 | elapsed time per iteration (ms): 5880.0 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.931988E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:31:45.510140 | finish at 2025-09-10 12:18:03 + [2025-09-09 23:46:23] iteration 4250/ 11920 | consumed samples: 4352000 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.925811E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:58:45.865602 | finish at 2025-09-10 11:45:09 + [2025-09-09 23:46:28] iteration 4251/ 11920 | consumed samples: 4353024 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.915538E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:58:30.213984 | finish at 2025-09-10 11:44:59 + [2025-09-09 23:46:34] iteration 4252/ 11920 | consumed samples: 4354048 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.910093E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:59:00.863986 | finish at 2025-09-10 11:45:35 + [2025-09-09 23:46:40] iteration 4253/ 11920 | consumed samples: 4355072 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.936696E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:58:56.970797 | finish at 2025-09-10 11:45:37 + [2025-09-09 23:46:45] iteration 4254/ 11920 | consumed samples: 4356096 | elapsed time per iteration (ms): 5636.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.927990E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:00:10.265293 | finish at 2025-09-10 11:46:56 + [2025-09-09 23:46:51] iteration 4255/ 11920 | consumed samples: 4357120 | elapsed time per iteration (ms): 5638.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.921385E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:00:21.629713 | finish at 2025-09-10 11:47:13 + [2025-09-09 23:46:57] iteration 4256/ 11920 | consumed samples: 4358144 | elapsed time per iteration (ms): 5636.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.927382E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:59:55.684765 | finish at 2025-09-10 11:46:52 + [2025-09-09 23:47:02] iteration 4257/ 11920 | consumed samples: 4359168 | elapsed time per iteration (ms): 5850.3 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.926553E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:27:10.560798 | finish at 2025-09-10 12:14:13 + [2025-09-09 23:47:08] iteration 4258/ 11920 | consumed samples: 4360192 | elapsed time per iteration (ms): 5632.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.939968E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:59:18.134417 | finish at 2025-09-10 11:46:26 + [2025-09-09 23:47:14] iteration 4259/ 11920 | consumed samples: 4361216 | elapsed time per iteration (ms): 5868.2 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.933336E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:29:16.558320 | finish at 2025-09-10 12:16:31 + [2025-09-09 23:47:20] iteration 4260/ 11920 | consumed samples: 4362240 | elapsed time per iteration (ms): 5627.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.916760E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:58:27.861266 | finish at 2025-09-10 11:45:47 + [2025-09-09 23:47:25] iteration 4261/ 11920 | consumed samples: 4363264 | elapsed time per iteration (ms): 5631.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.923759E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:58:49.701020 | finish at 2025-09-10 11:46:15 + [2025-09-09 23:47:31] iteration 4262/ 11920 | consumed samples: 4364288 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.928228E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:57:44.183225 | finish at 2025-09-10 11:45:15 + [2025-09-09 23:47:37] iteration 4263/ 11920 | consumed samples: 4365312 | elapsed time per iteration (ms): 5626.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.924258E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:58:03.615764 | finish at 2025-09-10 11:45:40 + [2025-09-09 23:47:42] iteration 4264/ 11920 | consumed samples: 4366336 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.924045E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:57:42.760317 | finish at 2025-09-10 11:45:25 + [2025-09-09 23:47:48] iteration 4265/ 11920 | consumed samples: 4367360 | elapsed time per iteration (ms): 5617.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.948605E+00 | loss scale: 1.0 | grad norm: 0.245 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:56:44.864911 | finish at 2025-09-10 11:44:33 + [2025-09-09 23:47:53] iteration 4266/ 11920 | consumed samples: 4368384 | elapsed time per iteration (ms): 5630.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.924751E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:58:18.635978 | finish at 2025-09-10 11:46:12 + [2025-09-09 23:47:59] iteration 4267/ 11920 | consumed samples: 4369408 | elapsed time per iteration (ms): 5991.0 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930913E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:44:09.096974 | finish at 2025-09-10 12:32:08 + [2025-09-09 23:48:05] iteration 4268/ 11920 | consumed samples: 4370432 | elapsed time per iteration (ms): 5648.7 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930252E+00 | loss scale: 1.0 | grad norm: 0.230 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:00:23.896176 | finish at 2025-09-10 11:48:29 + [2025-09-09 23:48:11] iteration 4269/ 11920 | consumed samples: 4371456 | elapsed time per iteration (ms): 5975.1 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.923609E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:41:55.459399 | finish at 2025-09-10 12:30:06 + [2025-09-09 23:48:17] iteration 4270/ 11920 | consumed samples: 4372480 | elapsed time per iteration (ms): 5628.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.929846E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:57:35.929220 | finish at 2025-09-10 11:45:53 + [2025-09-09 23:48:22] iteration 4271/ 11920 | consumed samples: 4373504 | elapsed time per iteration (ms): 5624.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.928792E+00 | loss scale: 1.0 | grad norm: 0.242 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:57:02.603189 | finish at 2025-09-10 11:45:25 + [2025-09-09 23:48:28] iteration 4272/ 11920 | consumed samples: 4374528 | elapsed time per iteration (ms): 5629.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.914930E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:57:34.535675 | finish at 2025-09-10 11:46:02 + [2025-09-09 23:48:34] iteration 4273/ 11920 | consumed samples: 4375552 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.923350E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:56:59.272080 | finish at 2025-09-10 11:45:33 + [2025-09-09 23:48:39] iteration 4274/ 11920 | consumed samples: 4376576 | elapsed time per iteration (ms): 5629.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.920552E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:57:26.162371 | finish at 2025-09-10 11:46:05 + [2025-09-09 23:48:45] iteration 4275/ 11920 | consumed samples: 4377600 | elapsed time per iteration (ms): 5628.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.920270E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:57:08.364066 | finish at 2025-09-10 11:45:53 + [2025-09-09 23:48:50] iteration 4276/ 11920 | consumed samples: 4378624 | elapsed time per iteration (ms): 5637.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.924050E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:58:15.204526 | finish at 2025-09-10 11:47:06 + [2025-09-09 23:48:56] iteration 4277/ 11920 | consumed samples: 4379648 | elapsed time per iteration (ms): 5635.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.929268E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:57:49.640624 | finish at 2025-09-10 11:46:46 + [2025-09-09 23:49:02] iteration 4278/ 11920 | consumed samples: 4380672 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.929609E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:56:02.010181 | finish at 2025-09-10 11:45:04 + [2025-09-09 23:49:07] iteration 4279/ 11920 | consumed samples: 4381696 | elapsed time per iteration (ms): 5627.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.922719E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:56:37.784123 | finish at 2025-09-10 11:45:45 + [2025-09-09 23:49:13] iteration 4280/ 11920 | consumed samples: 4382720 | elapsed time per iteration (ms): 5629.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.928883E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:56:51.313782 | finish at 2025-09-10 11:46:04 + [2025-09-09 23:49:19] iteration 4281/ 11920 | consumed samples: 4383744 | elapsed time per iteration (ms): 5630.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.918613E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:56:52.184175 | finish at 2025-09-10 11:46:11 + [2025-09-09 23:49:24] iteration 4282/ 11920 | consumed samples: 4384768 | elapsed time per iteration (ms): 5638.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.920693E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:57:45.617218 | finish at 2025-09-10 11:47:10 + [2025-09-09 23:49:30] iteration 4283/ 11920 | consumed samples: 4385792 | elapsed time per iteration (ms): 5630.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930937E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 10.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:56:41.691346 | finish at 2025-09-10 11:46:12 + [2025-09-09 23:49:35] iteration 4284/ 11920 | consumed samples: 4386816 | elapsed time per iteration (ms): 5630.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.920067E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:56:30.884777 | finish at 2025-09-10 11:46:06 + [2025-09-09 23:49:41] iteration 4285/ 11920 | consumed samples: 4387840 | elapsed time per iteration (ms): 5631.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.936775E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:56:34.181628 | finish at 2025-09-10 11:46:15 + [2025-09-09 23:49:47] iteration 4286/ 11920 | consumed samples: 4388864 | elapsed time per iteration (ms): 5870.0 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.924229E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:26:51.511783 | finish at 2025-09-10 12:16:38 + [2025-09-09 23:49:53] iteration 4287/ 11920 | consumed samples: 4389888 | elapsed time per iteration (ms): 5638.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.925546E+00 | loss scale: 1.0 | grad norm: 0.240 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:57:17.680313 | finish at 2025-09-10 11:47:10 + [2025-09-09 23:49:58] iteration 4288/ 11920 | consumed samples: 4390912 | elapsed time per iteration (ms): 5625.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.922918E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:55:36.836277 | finish at 2025-09-10 11:45:35 + [2025-09-09 23:50:04] iteration 4289/ 11920 | consumed samples: 4391936 | elapsed time per iteration (ms): 5839.7 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930138E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:22:42.668703 | finish at 2025-09-10 12:12:47 + [2025-09-09 23:50:10] iteration 4290/ 11920 | consumed samples: 4392960 | elapsed time per iteration (ms): 5628.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.928117E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:55:43.894067 | finish at 2025-09-10 11:45:54 + [2025-09-09 23:50:15] iteration 4291/ 11920 | consumed samples: 4393984 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930615E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:54:57.406106 | finish at 2025-09-10 11:45:13 + [2025-09-09 23:50:21] iteration 4292/ 11920 | consumed samples: 4395008 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.929788E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:54:51.279399 | finish at 2025-09-10 11:45:12 + [2025-09-09 23:50:27] iteration 4293/ 11920 | consumed samples: 4396032 | elapsed time per iteration (ms): 6068.4 | throughput per GPU (TFLOP/s/GPU): 74.4 | MFU 7.52% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.932304E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:51:23.826102 | finish at 2025-09-10 12:41:51 + [2025-09-09 23:50:33] iteration 4294/ 11920 | consumed samples: 4397056 | elapsed time per iteration (ms): 5636.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.935666E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:56:24.004462 | finish at 2025-09-10 11:46:57 + [2025-09-09 23:50:38] iteration 4295/ 11920 | consumed samples: 4398080 | elapsed time per iteration (ms): 5626.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.935616E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:54:57.893131 | finish at 2025-09-10 11:45:36 + [2025-09-09 23:50:44] iteration 4296/ 11920 | consumed samples: 4399104 | elapsed time per iteration (ms): 5631.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.926142E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:55:30.549826 | finish at 2025-09-10 11:46:14 + [2025-09-09 23:50:50] iteration 4297/ 11920 | consumed samples: 4400128 | elapsed time per iteration (ms): 5823.9 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.84% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.928392E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:19:55.451453 | finish at 2025-09-10 12:10:45 + [2025-09-09 23:50:55] iteration 4298/ 11920 | consumed samples: 4401152 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.918492E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:54:23.504478 | finish at 2025-09-10 11:45:19 + [2025-09-09 23:51:01] iteration 4299/ 11920 | consumed samples: 4402176 | elapsed time per iteration (ms): 5628.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.937850E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:54:56.466379 | finish at 2025-09-10 11:45:57 + [2025-09-09 23:51:07] iteration 4300/ 11920 | consumed samples: 4403200 | elapsed time per iteration (ms): 5955.3 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.912190E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:36:19.588366 | finish at 2025-09-10 12:27:27 + [2025-09-09 23:51:13] iteration 4301/ 11920 | consumed samples: 4404224 | elapsed time per iteration (ms): 5618.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.932101E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:53:26.815586 | finish at 2025-09-10 11:44:39 + [2025-09-09 23:51:18] iteration 4302/ 11920 | consumed samples: 4405248 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.926093E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:54:11.938367 | finish at 2025-09-10 11:45:30 + [2025-09-09 23:51:24] iteration 4303/ 11920 | consumed samples: 4406272 | elapsed time per iteration (ms): 5622.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.926985E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:53:46.712819 | finish at 2025-09-10 11:45:11 + [2025-09-09 23:51:30] iteration 4304/ 11920 | consumed samples: 4407296 | elapsed time per iteration (ms): 5840.5 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.940545E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:21:21.582993 | finish at 2025-09-10 12:12:51 + [2025-09-09 23:51:35] iteration 4305/ 11920 | consumed samples: 4408320 | elapsed time per iteration (ms): 5629.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.933824E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:54:26.521261 | finish at 2025-09-10 11:46:02 + [2025-09-09 23:51:41] iteration 4306/ 11920 | consumed samples: 4409344 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.927957E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:53:48.497671 | finish at 2025-09-10 11:45:29 + [2025-09-09 23:51:47] iteration 4307/ 11920 | consumed samples: 4410368 | elapsed time per iteration (ms): 5618.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.922307E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:52:54.183166 | finish at 2025-09-10 11:44:41 + [2025-09-09 23:51:52] iteration 4308/ 11920 | consumed samples: 4411392 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930785E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:53:01.684089 | finish at 2025-09-10 11:44:54 + [2025-09-09 23:51:58] iteration 4309/ 11920 | consumed samples: 4412416 | elapsed time per iteration (ms): 5618.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.931414E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:52:44.557391 | finish at 2025-09-10 11:44:42 + [2025-09-09 23:52:03] iteration 4310/ 11920 | consumed samples: 4413440 | elapsed time per iteration (ms): 5631.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.937304E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:54:12.293150 | finish at 2025-09-10 11:46:16 + [2025-09-09 23:52:09] iteration 4311/ 11920 | consumed samples: 4414464 | elapsed time per iteration (ms): 5627.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930633E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:53:36.123087 | finish at 2025-09-10 11:45:45 + [2025-09-09 23:52:15] iteration 4312/ 11920 | consumed samples: 4415488 | elapsed time per iteration (ms): 5630.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.929395E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:53:58.246731 | finish at 2025-09-10 11:46:13 + [2025-09-09 23:52:20] iteration 4313/ 11920 | consumed samples: 4416512 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.926556E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:52:55.968498 | finish at 2025-09-10 11:45:16 + [2025-09-09 23:52:26] iteration 4314/ 11920 | consumed samples: 4417536 | elapsed time per iteration (ms): 5629.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.923661E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:53:40.683756 | finish at 2025-09-10 11:46:07 + [2025-09-09 23:52:32] iteration 4315/ 11920 | consumed samples: 4418560 | elapsed time per iteration (ms): 5634.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.924479E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:54:07.484318 | finish at 2025-09-10 11:46:39 + [2025-09-09 23:52:37] iteration 4316/ 11920 | consumed samples: 4419584 | elapsed time per iteration (ms): 5830.5 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.919464E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:18:55.159216 | finish at 2025-09-10 12:11:33 + [2025-09-09 23:52:43] iteration 4317/ 11920 | consumed samples: 4420608 | elapsed time per iteration (ms): 5633.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.922982E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:53:52.621500 | finish at 2025-09-10 11:46:36 + [2025-09-09 23:52:49] iteration 4318/ 11920 | consumed samples: 4421632 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.935668E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:52:04.196108 | finish at 2025-09-10 11:44:53 + [2025-09-09 23:52:54] iteration 4319/ 11920 | consumed samples: 4422656 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.911206E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:51:58.084871 | finish at 2025-09-10 11:44:52 + [2025-09-09 23:53:00] iteration 4320/ 11920 | consumed samples: 4423680 | elapsed time per iteration (ms): 5861.1 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.919449E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:22:24.133186 | finish at 2025-09-10 12:15:24 + [2025-09-09 23:53:06] iteration 4321/ 11920 | consumed samples: 4424704 | elapsed time per iteration (ms): 5637.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.933619E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:54:01.295992 | finish at 2025-09-10 11:47:07 + [2025-09-09 23:53:11] iteration 4322/ 11920 | consumed samples: 4425728 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.925992E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:51:56.044604 | finish at 2025-09-10 11:45:07 + [2025-09-09 23:53:17] iteration 4323/ 11920 | consumed samples: 4426752 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.921476E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:51:51.634330 | finish at 2025-09-10 11:45:09 + [2025-09-09 23:53:23] iteration 4324/ 11920 | consumed samples: 4427776 | elapsed time per iteration (ms): 5915.9 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.910656E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:28:57.183583 | finish at 2025-09-10 12:22:20 + [2025-09-09 23:53:29] iteration 4325/ 11920 | consumed samples: 4428800 | elapsed time per iteration (ms): 5842.7 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.913220E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:19:35.512965 | finish at 2025-09-10 12:13:04 + [2025-09-09 23:53:35] iteration 4326/ 11920 | consumed samples: 4429824 | elapsed time per iteration (ms): 6244.3 | throughput per GPU (TFLOP/s/GPU): 72.3 | MFU 7.31% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.931845E+00 | loss scale: 1.0 | grad norm: 0.136 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:10:19.360009 | finish at 2025-09-10 13:03:54 + [2025-09-09 23:53:41] iteration 4327/ 11920 | consumed samples: 4430848 | elapsed time per iteration (ms): 5630.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.932082E+00 | loss scale: 1.0 | grad norm: 0.127 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:52:31.824087 | finish at 2025-09-10 11:46:12 + [2025-09-09 23:53:46] iteration 4328/ 11920 | consumed samples: 4431872 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.915952E+00 | loss scale: 1.0 | grad norm: 0.129 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:51:09.475491 | finish at 2025-09-10 11:44:56 + [2025-09-09 23:53:52] iteration 4329/ 11920 | consumed samples: 4432896 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.912916E+00 | loss scale: 1.0 | grad norm: 0.135 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:51:37.150712 | finish at 2025-09-10 11:45:29 + [2025-09-09 23:53:57] iteration 4330/ 11920 | consumed samples: 4433920 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.907979E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:51:09.573784 | finish at 2025-09-10 11:45:07 + [2025-09-09 23:54:03] iteration 4331/ 11920 | consumed samples: 4434944 | elapsed time per iteration (ms): 6004.6 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930833E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:39:28.772254 | finish at 2025-09-10 12:33:32 + [2025-09-09 23:54:09] iteration 4332/ 11920 | consumed samples: 4435968 | elapsed time per iteration (ms): 5937.5 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.917243E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:30:53.402649 | finish at 2025-09-10 12:25:03 + [2025-09-09 23:54:15] iteration 4333/ 11920 | consumed samples: 4436992 | elapsed time per iteration (ms): 5632.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.921985E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:52:10.729028 | finish at 2025-09-10 11:46:26 + [2025-09-09 23:54:21] iteration 4334/ 11920 | consumed samples: 4438016 | elapsed time per iteration (ms): 5624.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.907332E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:51:08.323615 | finish at 2025-09-10 11:45:29 + [2025-09-09 23:54:26] iteration 4335/ 11920 | consumed samples: 4439040 | elapsed time per iteration (ms): 5640.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.932544E+00 | loss scale: 1.0 | grad norm: 0.247 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:52:59.688332 | finish at 2025-09-10 11:47:26 + [2025-09-09 23:54:32] iteration 4336/ 11920 | consumed samples: 4440064 | elapsed time per iteration (ms): 5638.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.922232E+00 | loss scale: 1.0 | grad norm: 0.284 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:52:40.590111 | finish at 2025-09-10 11:47:13 + [2025-09-09 23:54:38] iteration 4337/ 11920 | consumed samples: 4441088 | elapsed time per iteration (ms): 5919.9 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930658E+00 | loss scale: 1.0 | grad norm: 0.264 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:28:10.266782 | finish at 2025-09-10 12:22:48 + [2025-09-09 23:54:44] iteration 4338/ 11920 | consumed samples: 4442112 | elapsed time per iteration (ms): 5629.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.936607E+00 | loss scale: 1.0 | grad norm: 0.241 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:51:23.079834 | finish at 2025-09-10 11:46:07 + [2025-09-09 23:54:49] iteration 4339/ 11920 | consumed samples: 4443136 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.940320E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:51:00.418694 | finish at 2025-09-10 11:45:50 + [2025-09-09 23:54:55] iteration 4340/ 11920 | consumed samples: 4444160 | elapsed time per iteration (ms): 5834.0 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.904592E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:17:01.382890 | finish at 2025-09-10 12:11:56 + [2025-09-09 23:55:01] iteration 4341/ 11920 | consumed samples: 4445184 | elapsed time per iteration (ms): 5992.1 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.917650E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:36:54.055431 | finish at 2025-09-10 12:31:55 + [2025-09-09 23:55:07] iteration 4342/ 11920 | consumed samples: 4446208 | elapsed time per iteration (ms): 5639.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930094E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:52:16.110389 | finish at 2025-09-10 11:47:23 + [2025-09-09 23:55:13] iteration 4343/ 11920 | consumed samples: 4447232 | elapsed time per iteration (ms): 5951.4 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.918742E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:31:34.074252 | finish at 2025-09-10 12:26:47 + [2025-09-09 23:55:18] iteration 4344/ 11920 | consumed samples: 4448256 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.927721E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:49:39.302900 | finish at 2025-09-10 11:44:57 + [2025-09-09 23:55:24] iteration 4345/ 11920 | consumed samples: 4449280 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930402E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:49:56.460146 | finish at 2025-09-10 11:45:20 + [2025-09-09 23:55:30] iteration 4346/ 11920 | consumed samples: 4450304 | elapsed time per iteration (ms): 6133.3 | throughput per GPU (TFLOP/s/GPU): 73.6 | MFU 7.44% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.919208E+00 | loss scale: 1.0 | grad norm: 0.244 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:54:13.726658 | finish at 2025-09-10 12:49:44 + [2025-09-09 23:55:36] iteration 4347/ 11920 | consumed samples: 4451328 | elapsed time per iteration (ms): 5958.6 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.931811E+00 | loss scale: 1.0 | grad norm: 0.264 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:32:04.191054 | finish at 2025-09-10 12:27:40 + [2025-09-09 23:55:42] iteration 4348/ 11920 | consumed samples: 4452352 | elapsed time per iteration (ms): 5859.4 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.917994E+00 | loss scale: 1.0 | grad norm: 0.240 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:19:27.140562 | finish at 2025-09-10 12:15:09 + [2025-09-09 23:55:47] iteration 4349/ 11920 | consumed samples: 4453376 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930936E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:49:39.057255 | finish at 2025-09-10 11:45:26 + [2025-09-09 23:55:53] iteration 4350/ 11920 | consumed samples: 4454400 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.918962E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:49:47.680604 | finish at 2025-09-10 11:45:41 + [2025-09-09 23:55:59] iteration 4351/ 11920 | consumed samples: 4455424 | elapsed time per iteration (ms): 5878.2 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.914405E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:21:32.074555 | finish at 2025-09-10 12:17:31 + [2025-09-09 23:56:05] iteration 4352/ 11920 | consumed samples: 4456448 | elapsed time per iteration (ms): 5630.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.917098E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:50:13.230465 | finish at 2025-09-10 11:46:18 + [2025-09-09 23:56:10] iteration 4353/ 11920 | consumed samples: 4457472 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.941062E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:49:03.781046 | finish at 2025-09-10 11:45:14 + [2025-09-09 23:56:16] iteration 4354/ 11920 | consumed samples: 4458496 | elapsed time per iteration (ms): 5953.3 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.951685E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:30:42.696656 | finish at 2025-09-10 12:26:59 + [2025-09-09 23:56:22] iteration 4355/ 11920 | consumed samples: 4459520 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.931424E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:49:17.738702 | finish at 2025-09-10 11:45:39 + [2025-09-09 23:56:27] iteration 4356/ 11920 | consumed samples: 4460544 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.912133E+00 | loss scale: 1.0 | grad norm: 0.275 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:48:46.643701 | finish at 2025-09-10 11:45:14 + [2025-09-09 23:56:33] iteration 4357/ 11920 | consumed samples: 4461568 | elapsed time per iteration (ms): 5633.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.923088E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:50:05.025259 | finish at 2025-09-10 11:46:38 + [2025-09-09 23:56:39] iteration 4358/ 11920 | consumed samples: 4462592 | elapsed time per iteration (ms): 5628.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.924433E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:49:21.943431 | finish at 2025-09-10 11:46:01 + [2025-09-09 23:56:44] iteration 4359/ 11920 | consumed samples: 4463616 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.921972E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:49:09.839797 | finish at 2025-09-10 11:45:54 + [2025-09-09 23:56:50] iteration 4360/ 11920 | consumed samples: 4464640 | elapsed time per iteration (ms): 5626.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.931405E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:48:55.975084 | finish at 2025-09-10 11:45:46 + [2025-09-09 23:56:56] iteration 4361/ 11920 | consumed samples: 4465664 | elapsed time per iteration (ms): 5885.4 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.917356E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:21:27.711997 | finish at 2025-09-10 12:18:23 + [2025-09-09 23:57:01] iteration 4362/ 11920 | consumed samples: 4466688 | elapsed time per iteration (ms): 5624.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.926221E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:48:30.427172 | finish at 2025-09-10 11:45:32 + [2025-09-09 23:57:07] iteration 4363/ 11920 | consumed samples: 4467712 | elapsed time per iteration (ms): 5634.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.928692E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:49:37.152849 | finish at 2025-09-10 11:46:44 + [2025-09-09 23:57:13] iteration 4364/ 11920 | consumed samples: 4468736 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.910052E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:48:01.728811 | finish at 2025-09-10 11:45:14 + [2025-09-09 23:57:18] iteration 4365/ 11920 | consumed samples: 4469760 | elapsed time per iteration (ms): 5626.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.940683E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:48:27.453755 | finish at 2025-09-10 11:45:46 + [2025-09-09 23:57:24] iteration 4366/ 11920 | consumed samples: 4470784 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.936391E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:48:15.196022 | finish at 2025-09-10 11:45:39 + [2025-09-09 23:57:29] iteration 4367/ 11920 | consumed samples: 4471808 | elapsed time per iteration (ms): 5627.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.925676E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:48:26.245681 | finish at 2025-09-10 11:45:56 + [2025-09-09 23:57:35] iteration 4368/ 11920 | consumed samples: 4472832 | elapsed time per iteration (ms): 5630.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.918116E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:48:44.282410 | finish at 2025-09-10 11:46:19 + [2025-09-09 23:57:41] iteration 4369/ 11920 | consumed samples: 4473856 | elapsed time per iteration (ms): 5854.8 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.926105E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:16:49.520881 | finish at 2025-09-10 12:14:30 + [2025-09-09 23:57:47] iteration 4370/ 11920 | consumed samples: 4474880 | elapsed time per iteration (ms): 5846.9 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.922176E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:15:44.053495 | finish at 2025-09-10 12:13:31 + [2025-09-09 23:57:52] iteration 4371/ 11920 | consumed samples: 4475904 | elapsed time per iteration (ms): 5621.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930602E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:47:12.805201 | finish at 2025-09-10 11:45:05 + [2025-09-09 23:57:58] iteration 4372/ 11920 | consumed samples: 4476928 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.924484E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:47:32.162436 | finish at 2025-09-10 11:45:30 + [2025-09-09 23:58:04] iteration 4373/ 11920 | consumed samples: 4477952 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.922113E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:47:23.885908 | finish at 2025-09-10 11:45:28 + [2025-09-09 23:58:10] iteration 4374/ 11920 | consumed samples: 4478976 | elapsed time per iteration (ms): 6060.0 | throughput per GPU (TFLOP/s/GPU): 74.5 | MFU 7.53% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.918403E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:42:08.703796 | finish at 2025-09-10 12:40:18 + [2025-09-09 23:58:15] iteration 4375/ 11920 | consumed samples: 4480000 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.921803E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:47:11.139568 | finish at 2025-09-10 11:45:27 + [2025-09-09 23:58:21] iteration 4376/ 11920 | consumed samples: 4481024 | elapsed time per iteration (ms): 5939.8 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930078E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:26:49.986279 | finish at 2025-09-10 12:25:11 + [2025-09-09 23:58:27] iteration 4377/ 11920 | consumed samples: 4482048 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.922211E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:46:52.777646 | finish at 2025-09-10 11:45:20 + [2025-09-09 23:58:33] iteration 4378/ 11920 | consumed samples: 4483072 | elapsed time per iteration (ms): 6000.9 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.927548E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:34:18.493130 | finish at 2025-09-10 12:32:51 + [2025-09-09 23:58:39] iteration 4379/ 11920 | consumed samples: 4484096 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.920995E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:46:58.351537 | finish at 2025-09-10 11:45:37 + [2025-09-09 23:58:44] iteration 4380/ 11920 | consumed samples: 4485120 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.929037E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:46:10.741782 | finish at 2025-09-10 11:44:55 + [2025-09-09 23:58:50] iteration 4381/ 11920 | consumed samples: 4486144 | elapsed time per iteration (ms): 5625.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.920283E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:46:53.854450 | finish at 2025-09-10 11:45:44 + [2025-09-09 23:58:55] iteration 4382/ 11920 | consumed samples: 4487168 | elapsed time per iteration (ms): 5619.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.921126E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:45:58.057910 | finish at 2025-09-10 11:44:53 + [2025-09-09 23:59:01] iteration 4383/ 11920 | consumed samples: 4488192 | elapsed time per iteration (ms): 5618.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.917243E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:45:46.647036 | finish at 2025-09-10 11:44:48 + [2025-09-09 23:59:07] iteration 4384/ 11920 | consumed samples: 4489216 | elapsed time per iteration (ms): 5628.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.922938E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:46:52.811188 | finish at 2025-09-10 11:45:59 + [2025-09-09 23:59:12] iteration 4385/ 11920 | consumed samples: 4490240 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.923325E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:46:12.237954 | finish at 2025-09-10 11:45:25 + [2025-09-09 23:59:18] iteration 4386/ 11920 | consumed samples: 4491264 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.921756E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:45:49.460420 | finish at 2025-09-10 11:45:07 + [2025-09-09 23:59:24] iteration 4387/ 11920 | consumed samples: 4492288 | elapsed time per iteration (ms): 5628.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.932473E+00 | loss scale: 1.0 | grad norm: 0.254 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:46:37.159168 | finish at 2025-09-10 11:46:01 + [2025-09-09 23:59:29] iteration 4388/ 11920 | consumed samples: 4493312 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.925159E+00 | loss scale: 1.0 | grad norm: 0.259 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:46:01.882835 | finish at 2025-09-10 11:45:31 + [2025-09-09 23:59:35] iteration 4389/ 11920 | consumed samples: 4494336 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.935847E+00 | loss scale: 1.0 | grad norm: 0.273 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:46:11.868922 | finish at 2025-09-10 11:45:47 + [2025-09-09 23:59:40] iteration 4390/ 11920 | consumed samples: 4495360 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930448E+00 | loss scale: 1.0 | grad norm: 0.274 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:46:01.170895 | finish at 2025-09-10 11:45:42 + [2025-09-09 23:59:46] iteration 4391/ 11920 | consumed samples: 4496384 | elapsed time per iteration (ms): 5632.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.932861E+00 | loss scale: 1.0 | grad norm: 0.285 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:46:46.731192 | finish at 2025-09-10 11:46:33 + [2025-09-09 23:59:52] iteration 4392/ 11920 | consumed samples: 4497408 | elapsed time per iteration (ms): 5630.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.927384E+00 | loss scale: 1.0 | grad norm: 0.262 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:46:23.633394 | finish at 2025-09-10 11:46:15 + [2025-09-09 23:59:57] iteration 4393/ 11920 | consumed samples: 4498432 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.935407E+00 | loss scale: 1.0 | grad norm: 0.246 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:45:03.237610 | finish at 2025-09-10 11:45:01 + [2025-09-10 00:00:03] iteration 4394/ 11920 | consumed samples: 4499456 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.953962E+00 | loss scale: 1.0 | grad norm: 0.256 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:45:50.823128 | finish at 2025-09-10 11:45:54 + [2025-09-10 00:00:09] iteration 4395/ 11920 | consumed samples: 4500480 | elapsed time per iteration (ms): 5628.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.945452E+00 | loss scale: 1.0 | grad norm: 0.255 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:45:55.768490 | finish at 2025-09-10 11:46:04 + [2025-09-10 00:00:14] iteration 4396/ 11920 | consumed samples: 4501504 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.935978E+00 | loss scale: 1.0 | grad norm: 0.243 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:45:14.800747 | finish at 2025-09-10 11:45:29 + [2025-09-10 00:00:20] iteration 4397/ 11920 | consumed samples: 4502528 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.935174E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:45:03.116118 | finish at 2025-09-10 11:45:23 + [2025-09-10 00:00:25] iteration 4398/ 11920 | consumed samples: 4503552 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.941826E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:44:54.139318 | finish at 2025-09-10 11:45:20 + [2025-09-10 00:00:31] iteration 4399/ 11920 | consumed samples: 4504576 | elapsed time per iteration (ms): 5626.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.932986E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:45:14.319966 | finish at 2025-09-10 11:45:45 + [2025-09-10 00:00:37] iteration 4400/ 11920 | consumed samples: 4505600 | elapsed time per iteration (ms): 5842.8 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.926861E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:12:17.494888 | finish at 2025-09-10 12:12:54 + [2025-09-10 00:00:43] iteration 4401/ 11920 | consumed samples: 4506624 | elapsed time per iteration (ms): 5626.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.932404E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:45:07.086818 | finish at 2025-09-10 11:45:50 + [2025-09-10 00:00:48] iteration 4402/ 11920 | consumed samples: 4507648 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.928385E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:44:13.865710 | finish at 2025-09-10 11:45:02 + [2025-09-10 00:00:54] iteration 4403/ 11920 | consumed samples: 4508672 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.925676E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:44:35.388105 | finish at 2025-09-10 11:45:29 + [2025-09-10 00:00:59] iteration 4404/ 11920 | consumed samples: 4509696 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.932741E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:44:32.688603 | finish at 2025-09-10 11:45:32 + [2025-09-10 00:01:05] iteration 4405/ 11920 | consumed samples: 4510720 | elapsed time per iteration (ms): 5630.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.918060E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:45:15.451316 | finish at 2025-09-10 11:46:20 + [2025-09-10 00:01:11] iteration 4406/ 11920 | consumed samples: 4511744 | elapsed time per iteration (ms): 5626.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.923907E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:44:33.428449 | finish at 2025-09-10 11:45:44 + [2025-09-10 00:01:16] iteration 4407/ 11920 | consumed samples: 4512768 | elapsed time per iteration (ms): 5619.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.920521E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:43:41.176548 | finish at 2025-09-10 11:44:57 + [2025-09-10 00:01:22] iteration 4408/ 11920 | consumed samples: 4513792 | elapsed time per iteration (ms): 5626.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.918719E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:44:28.262358 | finish at 2025-09-10 11:45:50 + [2025-09-10 00:01:28] iteration 4409/ 11920 | consumed samples: 4514816 | elapsed time per iteration (ms): 5623.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.923151E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:43:54.092638 | finish at 2025-09-10 11:45:22 + [2025-09-10 00:01:33] iteration 4410/ 11920 | consumed samples: 4515840 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.919554E+00 | loss scale: 1.0 | grad norm: 0.125 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:44:13.526258 | finish at 2025-09-10 11:45:47 + [2025-09-10 00:01:39] iteration 4411/ 11920 | consumed samples: 4516864 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.915854E+00 | loss scale: 1.0 | grad norm: 0.134 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:43:58.169757 | finish at 2025-09-10 11:45:37 + [2025-09-10 00:01:44] iteration 4412/ 11920 | consumed samples: 4517888 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.922792E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:43:44.618424 | finish at 2025-09-10 11:45:29 + [2025-09-10 00:01:50] iteration 4413/ 11920 | consumed samples: 4518912 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.925419E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:43:23.324703 | finish at 2025-09-10 11:45:13 + [2025-09-10 00:01:56] iteration 4414/ 11920 | consumed samples: 4519936 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.928864E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:43:08.935737 | finish at 2025-09-10 11:45:05 + [2025-09-10 00:02:01] iteration 4415/ 11920 | consumed samples: 4520960 | elapsed time per iteration (ms): 5629.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.911196E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:44:12.504910 | finish at 2025-09-10 11:46:14 + [2025-09-10 00:02:07] iteration 4416/ 11920 | consumed samples: 4521984 | elapsed time per iteration (ms): 5640.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.907456E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:45:24.942070 | finish at 2025-09-10 11:47:32 + [2025-09-10 00:02:13] iteration 4417/ 11920 | consumed samples: 4523008 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.917786E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 11.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:43:18.439580 | finish at 2025-09-10 11:45:31 + [2025-09-10 00:02:18] iteration 4418/ 11920 | consumed samples: 4524032 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.915068E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:43:09.740740 | finish at 2025-09-10 11:45:28 + [2025-09-10 00:02:24] iteration 4419/ 11920 | consumed samples: 4525056 | elapsed time per iteration (ms): 5619.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.909235E+00 | loss scale: 1.0 | grad norm: 0.133 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:42:27.753856 | finish at 2025-09-10 11:44:52 + [2025-09-10 00:02:29] iteration 4420/ 11920 | consumed samples: 4526080 | elapsed time per iteration (ms): 5628.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.922080E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:43:33.860750 | finish at 2025-09-10 11:46:03 + [2025-09-10 00:02:35] iteration 4421/ 11920 | consumed samples: 4527104 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.923862E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:42:45.860771 | finish at 2025-09-10 11:45:21 + [2025-09-10 00:02:41] iteration 4422/ 11920 | consumed samples: 4528128 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.922070E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:42:47.998150 | finish at 2025-09-10 11:45:29 + [2025-09-10 00:02:46] iteration 4423/ 11920 | consumed samples: 4529152 | elapsed time per iteration (ms): 5630.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.924786E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:43:28.157331 | finish at 2025-09-10 11:46:14 + [2025-09-10 00:02:52] iteration 4424/ 11920 | consumed samples: 4530176 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.929251E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:42:17.868734 | finish at 2025-09-10 11:45:10 + [2025-09-10 00:02:58] iteration 4425/ 11920 | consumed samples: 4531200 | elapsed time per iteration (ms): 5629.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.919135E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:43:15.755459 | finish at 2025-09-10 11:46:13 + [2025-09-10 00:03:03] iteration 4426/ 11920 | consumed samples: 4532224 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.924366E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:42:39.335252 | finish at 2025-09-10 11:45:43 + [2025-09-10 00:03:09] iteration 4427/ 11920 | consumed samples: 4533248 | elapsed time per iteration (ms): 5853.8 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.917544E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:11:02.897210 | finish at 2025-09-10 12:14:12 + [2025-09-10 00:03:15] iteration 4428/ 11920 | consumed samples: 4534272 | elapsed time per iteration (ms): 5626.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.924839E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:42:35.284062 | finish at 2025-09-10 11:45:50 + [2025-09-10 00:03:20] iteration 4429/ 11920 | consumed samples: 4535296 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.912552E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:41:56.737922 | finish at 2025-09-10 11:45:17 + [2025-09-10 00:03:26] iteration 4430/ 11920 | consumed samples: 4536320 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.917737E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:41:35.365250 | finish at 2025-09-10 11:45:01 + [2025-09-10 00:03:32] iteration 4431/ 11920 | consumed samples: 4537344 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.908063E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:42:10.192352 | finish at 2025-09-10 11:45:42 + [2025-09-10 00:03:37] iteration 4432/ 11920 | consumed samples: 4538368 | elapsed time per iteration (ms): 5630.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.911528E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:42:41.773727 | finish at 2025-09-10 11:46:19 + [2025-09-10 00:03:43] iteration 4433/ 11920 | consumed samples: 4539392 | elapsed time per iteration (ms): 5633.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.935662E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:42:55.509047 | finish at 2025-09-10 11:46:38 + [2025-09-10 00:03:48] iteration 4434/ 11920 | consumed samples: 4540416 | elapsed time per iteration (ms): 5626.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.907228E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:41:59.278544 | finish at 2025-09-10 11:45:48 + [2025-09-10 00:03:54] iteration 4435/ 11920 | consumed samples: 4541440 | elapsed time per iteration (ms): 5630.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.931873E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:42:21.143332 | finish at 2025-09-10 11:46:15 + [2025-09-10 00:04:00] iteration 4436/ 11920 | consumed samples: 4542464 | elapsed time per iteration (ms): 5630.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.919736E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:42:19.592218 | finish at 2025-09-10 11:46:19 + [2025-09-10 00:04:05] iteration 4437/ 11920 | consumed samples: 4543488 | elapsed time per iteration (ms): 5641.6 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.913510E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:43:36.284685 | finish at 2025-09-10 11:47:42 + [2025-09-10 00:04:11] iteration 4438/ 11920 | consumed samples: 4544512 | elapsed time per iteration (ms): 5632.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.902354E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:42:22.676674 | finish at 2025-09-10 11:46:34 + [2025-09-10 00:04:17] iteration 4439/ 11920 | consumed samples: 4545536 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.903167E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:40:46.789930 | finish at 2025-09-10 11:45:03 + [2025-09-10 00:04:22] iteration 4440/ 11920 | consumed samples: 4546560 | elapsed time per iteration (ms): 5918.2 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.925247E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:17:47.873964 | finish at 2025-09-10 12:22:10 + [2025-09-10 00:04:28] iteration 4441/ 11920 | consumed samples: 4547584 | elapsed time per iteration (ms): 5825.1 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.84% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.911409E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:06:06.140031 | finish at 2025-09-10 12:10:34 + [2025-09-10 00:04:34] iteration 4442/ 11920 | consumed samples: 4548608 | elapsed time per iteration (ms): 5850.6 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.923279E+00 | loss scale: 1.0 | grad norm: 0.262 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:09:10.918766 | finish at 2025-09-10 12:13:45 + [2025-09-10 00:04:40] iteration 4443/ 11920 | consumed samples: 4549632 | elapsed time per iteration (ms): 6186.4 | throughput per GPU (TFLOP/s/GPU): 73.0 | MFU 7.38% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.918828E+00 | loss scale: 1.0 | grad norm: 0.272 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:50:55.421754 | finish at 2025-09-10 12:55:36 + [2025-09-10 00:04:46] iteration 4444/ 11920 | consumed samples: 4550656 | elapsed time per iteration (ms): 5632.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.910622E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:41:46.963543 | finish at 2025-09-10 11:46:33 + [2025-09-10 00:04:52] iteration 4445/ 11920 | consumed samples: 4551680 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.924301E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:40:44.294405 | finish at 2025-09-10 11:45:36 + [2025-09-10 00:04:57] iteration 4446/ 11920 | consumed samples: 4552704 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.906422E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:40:05.746618 | finish at 2025-09-10 11:45:03 + [2025-09-10 00:05:03] iteration 4447/ 11920 | consumed samples: 4553728 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.924829E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:40:01.897380 | finish at 2025-09-10 11:45:05 + [2025-09-10 00:05:09] iteration 4448/ 11920 | consumed samples: 4554752 | elapsed time per iteration (ms): 6146.3 | throughput per GPU (TFLOP/s/GPU): 73.5 | MFU 7.43% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.912904E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:45:24.813919 | finish at 2025-09-10 12:50:34 + [2025-09-10 00:05:15] iteration 4449/ 11920 | consumed samples: 4555776 | elapsed time per iteration (ms): 5616.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.916635E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:39:19.631026 | finish at 2025-09-10 11:44:34 + [2025-09-10 00:05:20] iteration 4450/ 11920 | consumed samples: 4556800 | elapsed time per iteration (ms): 5827.5 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.918461E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:05:31.388383 | finish at 2025-09-10 12:10:52 + [2025-09-10 00:05:26] iteration 4451/ 11920 | consumed samples: 4557824 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.922540E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:39:39.401187 | finish at 2025-09-10 11:45:05 + [2025-09-10 00:05:32] iteration 4452/ 11920 | consumed samples: 4558848 | elapsed time per iteration (ms): 5621.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.917066E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:39:37.770825 | finish at 2025-09-10 11:45:09 + [2025-09-10 00:05:38] iteration 4453/ 11920 | consumed samples: 4559872 | elapsed time per iteration (ms): 6310.8 | throughput per GPU (TFLOP/s/GPU): 71.5 | MFU 7.23% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.923786E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:05:23.000523 | finish at 2025-09-10 13:11:01 + [2025-09-10 00:05:44] iteration 4454/ 11920 | consumed samples: 4560896 | elapsed time per iteration (ms): 5929.0 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.922929E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:17:46.185605 | finish at 2025-09-10 12:23:30 + [2025-09-10 00:05:50] iteration 4455/ 11920 | consumed samples: 4561920 | elapsed time per iteration (ms): 5863.9 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.911130E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:09:33.642901 | finish at 2025-09-10 12:15:23 + [2025-09-10 00:05:56] iteration 4456/ 11920 | consumed samples: 4562944 | elapsed time per iteration (ms): 5831.7 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.915531E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:05:27.966047 | finish at 2025-09-10 12:11:24 + [2025-09-10 00:06:01] iteration 4457/ 11920 | consumed samples: 4563968 | elapsed time per iteration (ms): 5618.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.911044E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:38:49.607480 | finish at 2025-09-10 11:44:51 + [2025-09-10 00:06:07] iteration 4458/ 11920 | consumed samples: 4564992 | elapsed time per iteration (ms): 5631.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.918306E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:40:23.407665 | finish at 2025-09-10 11:46:30 + [2025-09-10 00:06:13] iteration 4459/ 11920 | consumed samples: 4566016 | elapsed time per iteration (ms): 5838.8 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.918428E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:06:03.249097 | finish at 2025-09-10 12:12:16 + [2025-09-10 00:06:19] iteration 4460/ 11920 | consumed samples: 4567040 | elapsed time per iteration (ms): 6142.6 | throughput per GPU (TFLOP/s/GPU): 73.5 | MFU 7.43% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.919244E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:43:43.928061 | finish at 2025-09-10 12:50:03 + [2025-09-10 00:06:24] iteration 4461/ 11920 | consumed samples: 4568064 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.932780E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:39:05.683754 | finish at 2025-09-10 11:45:30 + [2025-09-10 00:06:30] iteration 4462/ 11920 | consumed samples: 4569088 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.913037E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:39:02.586971 | finish at 2025-09-10 11:45:33 + [2025-09-10 00:06:36] iteration 4463/ 11920 | consumed samples: 4570112 | elapsed time per iteration (ms): 5867.0 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.925177E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:09:10.253548 | finish at 2025-09-10 12:15:46 + [2025-09-10 00:06:42] iteration 4464/ 11920 | consumed samples: 4571136 | elapsed time per iteration (ms): 5630.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.910663E+00 | loss scale: 1.0 | grad norm: 0.254 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:39:43.803047 | finish at 2025-09-10 11:46:25 + [2025-09-10 00:06:47] iteration 4465/ 11920 | consumed samples: 4572160 | elapsed time per iteration (ms): 5629.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.929058E+00 | loss scale: 1.0 | grad norm: 0.261 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:39:27.736995 | finish at 2025-09-10 11:46:15 + [2025-09-10 00:06:53] iteration 4466/ 11920 | consumed samples: 4573184 | elapsed time per iteration (ms): 6176.3 | throughput per GPU (TFLOP/s/GPU): 73.1 | MFU 7.39% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.901742E+00 | loss scale: 1.0 | grad norm: 0.255 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:47:18.366265 | finish at 2025-09-10 12:54:12 + [2025-09-10 00:06:59] iteration 4467/ 11920 | consumed samples: 4574208 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.919253E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:38:47.954706 | finish at 2025-09-10 11:45:47 + [2025-09-10 00:07:05] iteration 4468/ 11920 | consumed samples: 4575232 | elapsed time per iteration (ms): 5630.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.929152E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:39:15.754025 | finish at 2025-09-10 11:46:20 + [2025-09-10 00:07:10] iteration 4469/ 11920 | consumed samples: 4576256 | elapsed time per iteration (ms): 5636.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.927599E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:39:57.398961 | finish at 2025-09-10 11:47:08 + [2025-09-10 00:07:16] iteration 4470/ 11920 | consumed samples: 4577280 | elapsed time per iteration (ms): 5634.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.920690E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:39:35.433707 | finish at 2025-09-10 11:46:51 + [2025-09-10 00:07:22] iteration 4471/ 11920 | consumed samples: 4578304 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.936266E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:38:05.220150 | finish at 2025-09-10 11:45:27 + [2025-09-10 00:07:27] iteration 4472/ 11920 | consumed samples: 4579328 | elapsed time per iteration (ms): 5631.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.917965E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:39:05.255262 | finish at 2025-09-10 11:46:32 + [2025-09-10 00:07:33] iteration 4473/ 11920 | consumed samples: 4580352 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.919424E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:38:09.774488 | finish at 2025-09-10 11:45:43 + [2025-09-10 00:07:38] iteration 4474/ 11920 | consumed samples: 4581376 | elapsed time per iteration (ms): 5647.0 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.916771E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:40:47.218153 | finish at 2025-09-10 11:48:26 + [2025-09-10 00:07:44] iteration 4475/ 11920 | consumed samples: 4582400 | elapsed time per iteration (ms): 5646.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.931154E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:40:36.858504 | finish at 2025-09-10 11:48:21 + [2025-09-10 00:07:50] iteration 4476/ 11920 | consumed samples: 4583424 | elapsed time per iteration (ms): 5641.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.935616E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:39:52.564402 | finish at 2025-09-10 11:47:42 + [2025-09-10 00:07:55] iteration 4477/ 11920 | consumed samples: 4584448 | elapsed time per iteration (ms): 5631.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.917773E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:38:36.092130 | finish at 2025-09-10 11:46:31 + [2025-09-10 00:08:01] iteration 4478/ 11920 | consumed samples: 4585472 | elapsed time per iteration (ms): 5628.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.918961E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:38:06.400859 | finish at 2025-09-10 11:46:07 + [2025-09-10 00:08:07] iteration 4479/ 11920 | consumed samples: 4586496 | elapsed time per iteration (ms): 5632.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.914998E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:38:28.355761 | finish at 2025-09-10 11:46:35 + [2025-09-10 00:08:12] iteration 4480/ 11920 | consumed samples: 4587520 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.917938E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:37:20.180054 | finish at 2025-09-10 11:45:32 + [2025-09-10 00:08:18] iteration 4481/ 11920 | consumed samples: 4588544 | elapsed time per iteration (ms): 5626.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.921540E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:37:38.327878 | finish at 2025-09-10 11:45:56 + [2025-09-10 00:08:24] iteration 4482/ 11920 | consumed samples: 4589568 | elapsed time per iteration (ms): 5623.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.935632E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:37:04.045321 | finish at 2025-09-10 11:45:28 + [2025-09-10 00:08:29] iteration 4483/ 11920 | consumed samples: 4590592 | elapsed time per iteration (ms): 5859.0 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.918026E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:06:13.576029 | finish at 2025-09-10 12:14:43 + [2025-09-10 00:08:35] iteration 4484/ 11920 | consumed samples: 4591616 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.918067E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:37:11.152134 | finish at 2025-09-10 11:45:46 + [2025-09-10 00:08:41] iteration 4485/ 11920 | consumed samples: 4592640 | elapsed time per iteration (ms): 5630.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.906819E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:37:42.574863 | finish at 2025-09-10 11:46:23 + [2025-09-10 00:08:46] iteration 4486/ 11920 | consumed samples: 4593664 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.915621E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:36:48.256459 | finish at 2025-09-10 11:45:35 + [2025-09-10 00:08:52] iteration 4487/ 11920 | consumed samples: 4594688 | elapsed time per iteration (ms): 6020.8 | throughput per GPU (TFLOP/s/GPU): 75.0 | MFU 7.58% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.913279E+00 | loss scale: 1.0 | grad norm: 0.128 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:25:52.463700 | finish at 2025-09-10 12:34:45 + [2025-09-10 00:08:58] iteration 4488/ 11920 | consumed samples: 4595712 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.908433E+00 | loss scale: 1.0 | grad norm: 0.142 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:36:06.371994 | finish at 2025-09-10 11:45:04 + [2025-09-10 00:09:04] iteration 4489/ 11920 | consumed samples: 4596736 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.916950E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:36:26.556834 | finish at 2025-09-10 11:45:30 + [2025-09-10 00:09:09] iteration 4490/ 11920 | consumed samples: 4597760 | elapsed time per iteration (ms): 5634.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.916758E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:37:46.469796 | finish at 2025-09-10 11:46:56 + [2025-09-10 00:09:15] iteration 4491/ 11920 | consumed samples: 4598784 | elapsed time per iteration (ms): 5617.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.894119E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:35:35.222451 | finish at 2025-09-10 11:44:50 + [2025-09-10 00:09:20] iteration 4492/ 11920 | consumed samples: 4599808 | elapsed time per iteration (ms): 5630.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.920612E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:37:02.160945 | finish at 2025-09-10 11:46:23 + [2025-09-10 00:09:26] iteration 4493/ 11920 | consumed samples: 4600832 | elapsed time per iteration (ms): 5979.5 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.937836E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:20:09.573379 | finish at 2025-09-10 12:29:36 + [2025-09-10 00:09:32] iteration 4494/ 11920 | consumed samples: 4601856 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.927790E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:36:20.653106 | finish at 2025-09-10 11:45:53 + [2025-09-10 00:09:38] iteration 4495/ 11920 | consumed samples: 4602880 | elapsed time per iteration (ms): 5839.4 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.924171E+00 | loss scale: 1.0 | grad norm: 0.249 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:02:37.619745 | finish at 2025-09-10 12:12:15 + [2025-09-10 00:09:43] iteration 4496/ 11920 | consumed samples: 4603904 | elapsed time per iteration (ms): 5627.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.917043E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:36:19.548096 | finish at 2025-09-10 11:46:03 + [2025-09-10 00:09:50] iteration 4497/ 11920 | consumed samples: 4604928 | elapsed time per iteration (ms): 6183.1 | throughput per GPU (TFLOP/s/GPU): 73.0 | MFU 7.38% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.913953E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:44:57.020226 | finish at 2025-09-10 12:54:47 + [2025-09-10 00:09:55] iteration 4498/ 11920 | consumed samples: 4605952 | elapsed time per iteration (ms): 5648.9 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.924964E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:38:45.854681 | finish at 2025-09-10 11:48:41 + [2025-09-10 00:10:01] iteration 4499/ 11920 | consumed samples: 4606976 | elapsed time per iteration (ms): 5627.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.916200E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:35:59.522912 | finish at 2025-09-10 11:46:00 + [2025-09-10 00:10:07] iteration 4500/ 11920 | consumed samples: 4608000 | elapsed time per iteration (ms): 5631.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.903194E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:36:23.447948 | finish at 2025-09-10 11:46:30 + [2025-09-10 00:10:12] iteration 4501/ 11920 | consumed samples: 4609024 | elapsed time per iteration (ms): 5637.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.920773E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:37:01.338754 | finish at 2025-09-10 11:47:14 + [2025-09-10 00:10:18] iteration 4502/ 11920 | consumed samples: 4610048 | elapsed time per iteration (ms): 5634.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.908083E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:36:39.718956 | finish at 2025-09-10 11:46:58 + [2025-09-10 00:10:23] iteration 4503/ 11920 | consumed samples: 4611072 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.923504E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:34:58.204083 | finish at 2025-09-10 11:45:22 + [2025-09-10 00:10:29] iteration 4504/ 11920 | consumed samples: 4612096 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.915158E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:34:58.984440 | finish at 2025-09-10 11:45:28 + [2025-09-10 00:10:35] iteration 4505/ 11920 | consumed samples: 4613120 | elapsed time per iteration (ms): 5629.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.914865E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:35:39.874358 | finish at 2025-09-10 11:46:15 + [2025-09-10 00:10:40] iteration 4506/ 11920 | consumed samples: 4614144 | elapsed time per iteration (ms): 5628.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.919085E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:35:32.578365 | finish at 2025-09-10 11:46:13 + [2025-09-10 00:10:46] iteration 4507/ 11920 | consumed samples: 4615168 | elapsed time per iteration (ms): 5627.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.914135E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:35:14.028038 | finish at 2025-09-10 11:46:00 + [2025-09-10 00:10:52] iteration 4508/ 11920 | consumed samples: 4616192 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.912599E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:34:52.577755 | finish at 2025-09-10 11:45:44 + [2025-09-10 00:10:57] iteration 4509/ 11920 | consumed samples: 4617216 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.909209E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:34:17.939917 | finish at 2025-09-10 11:45:15 + [2025-09-10 00:11:03] iteration 4510/ 11920 | consumed samples: 4618240 | elapsed time per iteration (ms): 5626.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.917918E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:34:49.781306 | finish at 2025-09-10 11:45:53 + [2025-09-10 00:11:08] iteration 4511/ 11920 | consumed samples: 4619264 | elapsed time per iteration (ms): 5632.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.904676E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:35:32.248338 | finish at 2025-09-10 11:46:41 + [2025-09-10 00:11:14] iteration 4512/ 11920 | consumed samples: 4620288 | elapsed time per iteration (ms): 5884.1 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.923934E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:06:29.523777 | finish at 2025-09-10 12:17:44 + [2025-09-10 00:11:20] iteration 4513/ 11920 | consumed samples: 4621312 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.916550E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:34:11.746574 | finish at 2025-09-10 11:45:32 + [2025-09-10 00:11:26] iteration 4514/ 11920 | consumed samples: 4622336 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.911088E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:34:09.564683 | finish at 2025-09-10 11:45:35 + [2025-09-10 00:11:31] iteration 4515/ 11920 | consumed samples: 4623360 | elapsed time per iteration (ms): 5628.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.906112E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:34:37.361641 | finish at 2025-09-10 11:46:09 + [2025-09-10 00:11:37] iteration 4516/ 11920 | consumed samples: 4624384 | elapsed time per iteration (ms): 5637.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.906044E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:35:38.717588 | finish at 2025-09-10 11:47:16 + [2025-09-10 00:11:42] iteration 4517/ 11920 | consumed samples: 4625408 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.914090E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:34:00.874238 | finish at 2025-09-10 11:45:43 + [2025-09-10 00:11:48] iteration 4518/ 11920 | consumed samples: 4626432 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.912069E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:33:32.744971 | finish at 2025-09-10 11:45:21 + [2025-09-10 00:11:54] iteration 4519/ 11920 | consumed samples: 4627456 | elapsed time per iteration (ms): 5627.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.912803E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:34:11.762376 | finish at 2025-09-10 11:46:06 + [2025-09-10 00:11:59] iteration 4520/ 11920 | consumed samples: 4628480 | elapsed time per iteration (ms): 5621.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.908000E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:33:15.133972 | finish at 2025-09-10 11:45:14 + [2025-09-10 00:12:05] iteration 4521/ 11920 | consumed samples: 4629504 | elapsed time per iteration (ms): 5628.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.912093E+00 | loss scale: 1.0 | grad norm: 0.240 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:34:07.167751 | finish at 2025-09-10 11:46:12 + [2025-09-10 00:12:11] iteration 4522/ 11920 | consumed samples: 4630528 | elapsed time per iteration (ms): 5981.9 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930814E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:17:34.462903 | finish at 2025-09-10 12:29:45 + [2025-09-10 00:12:17] iteration 4523/ 11920 | consumed samples: 4631552 | elapsed time per iteration (ms): 5630.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.919133E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:34:04.885108 | finish at 2025-09-10 11:46:21 + [2025-09-10 00:12:22] iteration 4524/ 11920 | consumed samples: 4632576 | elapsed time per iteration (ms): 5627.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.908110E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:33:43.898177 | finish at 2025-09-10 11:46:06 + [2025-09-10 00:12:28] iteration 4525/ 11920 | consumed samples: 4633600 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.916303E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:33:12.042328 | finish at 2025-09-10 11:45:40 + [2025-09-10 00:12:33] iteration 4526/ 11920 | consumed samples: 4634624 | elapsed time per iteration (ms): 5631.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.907819E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:34:00.698419 | finish at 2025-09-10 11:46:34 + [2025-09-10 00:12:39] iteration 4527/ 11920 | consumed samples: 4635648 | elapsed time per iteration (ms): 5643.0 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.898096E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:35:18.391471 | finish at 2025-09-10 11:47:58 + [2025-09-10 00:12:45] iteration 4528/ 11920 | consumed samples: 4636672 | elapsed time per iteration (ms): 5631.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.926104E+00 | loss scale: 1.0 | grad norm: 0.238 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:33:44.361122 | finish at 2025-09-10 11:46:29 + [2025-09-10 00:12:50] iteration 4529/ 11920 | consumed samples: 4637696 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.920802E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:32:31.313720 | finish at 2025-09-10 11:45:22 + [2025-09-10 00:12:56] iteration 4530/ 11920 | consumed samples: 4638720 | elapsed time per iteration (ms): 5851.6 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.917726E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:00:43.453877 | finish at 2025-09-10 12:13:40 + [2025-09-10 00:13:02] iteration 4531/ 11920 | consumed samples: 4639744 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.909713E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:32:29.702799 | finish at 2025-09-10 11:45:32 + [2025-09-10 00:13:07] iteration 4532/ 11920 | consumed samples: 4640768 | elapsed time per iteration (ms): 5628.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.920036E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:33:00.770337 | finish at 2025-09-10 11:46:08 + [2025-09-10 00:13:13] iteration 4533/ 11920 | consumed samples: 4641792 | elapsed time per iteration (ms): 5630.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.912236E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:33:13.314229 | finish at 2025-09-10 11:46:26 + [2025-09-10 00:13:19] iteration 4534/ 11920 | consumed samples: 4642816 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.918566E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:32:26.350375 | finish at 2025-09-10 11:45:45 + [2025-09-10 00:13:24] iteration 4535/ 11920 | consumed samples: 4643840 | elapsed time per iteration (ms): 5624.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.923267E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:32:17.969832 | finish at 2025-09-10 11:45:42 + [2025-09-10 00:13:30] iteration 4536/ 11920 | consumed samples: 4644864 | elapsed time per iteration (ms): 5632.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.913507E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:33:08.546844 | finish at 2025-09-10 11:46:39 + [2025-09-10 00:13:36] iteration 4537/ 11920 | consumed samples: 4645888 | elapsed time per iteration (ms): 5629.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.921669E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:32:40.619338 | finish at 2025-09-10 11:46:16 + [2025-09-10 00:13:41] iteration 4538/ 11920 | consumed samples: 4646912 | elapsed time per iteration (ms): 5638.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.924045E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:33:41.583451 | finish at 2025-09-10 11:47:23 + [2025-09-10 00:13:47] iteration 4539/ 11920 | consumed samples: 4647936 | elapsed time per iteration (ms): 5630.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.917447E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:32:39.940596 | finish at 2025-09-10 11:46:27 + [2025-09-10 00:13:53] iteration 4540/ 11920 | consumed samples: 4648960 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.919796E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:31:12.201505 | finish at 2025-09-10 11:45:05 + [2025-09-10 00:13:58] iteration 4541/ 11920 | consumed samples: 4649984 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.903271E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:31:24.625251 | finish at 2025-09-10 11:45:23 + [2025-09-10 00:14:04] iteration 4542/ 11920 | consumed samples: 4651008 | elapsed time per iteration (ms): 5627.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.913394E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:32:00.286464 | finish at 2025-09-10 11:46:04 + [2025-09-10 00:14:09] iteration 4543/ 11920 | consumed samples: 4652032 | elapsed time per iteration (ms): 5641.2 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.913947E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:33:34.821574 | finish at 2025-09-10 11:47:44 + [2025-09-10 00:14:15] iteration 4544/ 11920 | consumed samples: 4653056 | elapsed time per iteration (ms): 5875.1 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.924600E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:02:15.067455 | finish at 2025-09-10 12:16:30 + [2025-09-10 00:14:21] iteration 4545/ 11920 | consumed samples: 4654080 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.914073E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:31:23.288348 | finish at 2025-09-10 11:45:44 + [2025-09-10 00:14:27] iteration 4546/ 11920 | consumed samples: 4655104 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.922116E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:31:05.815669 | finish at 2025-09-10 11:45:32 + [2025-09-10 00:14:32] iteration 4547/ 11920 | consumed samples: 4656128 | elapsed time per iteration (ms): 5634.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.911765E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:32:22.936659 | finish at 2025-09-10 11:46:55 + [2025-09-10 00:14:38] iteration 4548/ 11920 | consumed samples: 4657152 | elapsed time per iteration (ms): 5643.9 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.892177E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:33:27.179959 | finish at 2025-09-10 11:48:05 + [2025-09-10 00:14:43] iteration 4549/ 11920 | consumed samples: 4658176 | elapsed time per iteration (ms): 5639.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.909853E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:32:48.734452 | finish at 2025-09-10 11:47:32 + [2025-09-10 00:14:49] iteration 4550/ 11920 | consumed samples: 4659200 | elapsed time per iteration (ms): 5618.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.908485E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:30:08.627858 | finish at 2025-09-10 11:44:58 + [2025-09-10 00:14:55] iteration 4551/ 11920 | consumed samples: 4660224 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.927476E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:30:41.093782 | finish at 2025-09-10 11:45:36 + [2025-09-10 00:15:00] iteration 4552/ 11920 | consumed samples: 4661248 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.926477E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:30:08.340094 | finish at 2025-09-10 11:45:09 + [2025-09-10 00:15:06] iteration 4553/ 11920 | consumed samples: 4662272 | elapsed time per iteration (ms): 5841.0 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.904123E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:57:10.568551 | finish at 2025-09-10 12:12:17 + [2025-09-10 00:15:12] iteration 4554/ 11920 | consumed samples: 4663296 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.920092E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:30:50.983505 | finish at 2025-09-10 11:46:03 + [2025-09-10 00:15:17] iteration 4555/ 11920 | consumed samples: 4664320 | elapsed time per iteration (ms): 5618.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.919676E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:29:42.937310 | finish at 2025-09-10 11:45:00 + [2025-09-10 00:15:23] iteration 4556/ 11920 | consumed samples: 4665344 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.906343E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:30:16.398892 | finish at 2025-09-10 11:45:39 + [2025-09-10 00:15:29] iteration 4557/ 11920 | consumed samples: 4666368 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.916509E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:29:57.508589 | finish at 2025-09-10 11:45:26 + [2025-09-10 00:15:34] iteration 4558/ 11920 | consumed samples: 4667392 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.905698E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:30:05.736799 | finish at 2025-09-10 11:45:40 + [2025-09-10 00:15:40] iteration 4559/ 11920 | consumed samples: 4668416 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.917354E+00 | loss scale: 1.0 | grad norm: 0.246 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:29:31.546426 | finish at 2025-09-10 11:45:11 + [2025-09-10 00:15:46] iteration 4560/ 11920 | consumed samples: 4669440 | elapsed time per iteration (ms): 5631.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.924335E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:30:49.596558 | finish at 2025-09-10 11:46:35 + [2025-09-10 00:15:51] iteration 4561/ 11920 | consumed samples: 4670464 | elapsed time per iteration (ms): 5811.6 | throughput per GPU (TFLOP/s/GPU): 77.7 | MFU 7.86% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.917234E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:52:47.827357 | finish at 2025-09-10 12:08:39 + [2025-09-10 00:15:57] iteration 4562/ 11920 | consumed samples: 4671488 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.903615E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:29:45.290552 | finish at 2025-09-10 11:45:42 + [2025-09-10 00:16:03] iteration 4563/ 11920 | consumed samples: 4672512 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.927945E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:29:30.181898 | finish at 2025-09-10 11:45:33 + [2025-09-10 00:16:08] iteration 4564/ 11920 | consumed samples: 4673536 | elapsed time per iteration (ms): 5841.6 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.918116E+00 | loss scale: 1.0 | grad norm: 0.230 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:56:10.695170 | finish at 2025-09-10 12:12:19 + [2025-09-10 00:16:14] iteration 4565/ 11920 | consumed samples: 4674560 | elapsed time per iteration (ms): 5819.1 | throughput per GPU (TFLOP/s/GPU): 77.6 | MFU 7.85% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.917408E+00 | loss scale: 1.0 | grad norm: 0.246 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:53:19.148363 | finish at 2025-09-10 12:09:33 + [2025-09-10 00:16:20] iteration 4566/ 11920 | consumed samples: 4675584 | elapsed time per iteration (ms): 5837.9 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.917488E+00 | loss scale: 1.0 | grad norm: 0.265 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:55:31.861683 | finish at 2025-09-10 12:11:52 + [2025-09-10 00:16:26] iteration 4567/ 11920 | consumed samples: 4676608 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.920940E+00 | loss scale: 1.0 | grad norm: 0.261 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:28:58.983082 | finish at 2025-09-10 11:45:25 + [2025-09-10 00:16:31] iteration 4568/ 11920 | consumed samples: 4677632 | elapsed time per iteration (ms): 5627.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.910380E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:29:31.117487 | finish at 2025-09-10 11:46:02 + [2025-09-10 00:16:37] iteration 4569/ 11920 | consumed samples: 4678656 | elapsed time per iteration (ms): 5853.6 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.926293E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:57:09.801517 | finish at 2025-09-10 12:13:47 + [2025-09-10 00:16:43] iteration 4570/ 11920 | consumed samples: 4679680 | elapsed time per iteration (ms): 5632.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.918453E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:29:56.077716 | finish at 2025-09-10 11:46:39 + [2025-09-10 00:16:48] iteration 4571/ 11920 | consumed samples: 4680704 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.911210E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:28:31.478483 | finish at 2025-09-10 11:45:20 + [2025-09-10 00:16:54] iteration 4572/ 11920 | consumed samples: 4681728 | elapsed time per iteration (ms): 5857.3 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.895351E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:57:19.530064 | finish at 2025-09-10 12:14:14 + [2025-09-10 00:17:00] iteration 4573/ 11920 | consumed samples: 4682752 | elapsed time per iteration (ms): 5622.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.922725E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:28:29.295327 | finish at 2025-09-10 11:45:29 + [2025-09-10 00:17:06] iteration 4574/ 11920 | consumed samples: 4683776 | elapsed time per iteration (ms): 5626.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.906850E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:28:55.273643 | finish at 2025-09-10 11:46:01 + [2025-09-10 00:17:11] iteration 4575/ 11920 | consumed samples: 4684800 | elapsed time per iteration (ms): 5637.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.913156E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:30:07.598959 | finish at 2025-09-10 11:47:19 + [2025-09-10 00:17:17] iteration 4576/ 11920 | consumed samples: 4685824 | elapsed time per iteration (ms): 5618.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.900328E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:27:39.917690 | finish at 2025-09-10 11:44:57 + [2025-09-10 00:17:22] iteration 4577/ 11920 | consumed samples: 4686848 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.907385E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:27:53.305192 | finish at 2025-09-10 11:45:16 + [2025-09-10 00:17:28] iteration 4578/ 11920 | consumed samples: 4687872 | elapsed time per iteration (ms): 5617.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.914637E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:27:23.608469 | finish at 2025-09-10 11:44:52 + [2025-09-10 00:17:34] iteration 4579/ 11920 | consumed samples: 4688896 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.897813E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:28:20.955533 | finish at 2025-09-10 11:45:55 + [2025-09-10 00:17:39] iteration 4580/ 11920 | consumed samples: 4689920 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.909788E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:27:30.242662 | finish at 2025-09-10 11:45:10 + [2025-09-10 00:17:45] iteration 4581/ 11920 | consumed samples: 4690944 | elapsed time per iteration (ms): 5622.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.908644E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:27:43.800042 | finish at 2025-09-10 11:45:29 + [2025-09-10 00:17:51] iteration 4582/ 11920 | consumed samples: 4691968 | elapsed time per iteration (ms): 5632.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.898170E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:28:50.331024 | finish at 2025-09-10 11:46:41 + [2025-09-10 00:17:56] iteration 4583/ 11920 | consumed samples: 4692992 | elapsed time per iteration (ms): 5829.9 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.923687E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:52:53.773577 | finish at 2025-09-10 12:10:50 + [2025-09-10 00:18:02] iteration 4584/ 11920 | consumed samples: 4694016 | elapsed time per iteration (ms): 5630.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.911935E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:28:21.483198 | finish at 2025-09-10 11:46:23 + [2025-09-10 00:18:08] iteration 4585/ 11920 | consumed samples: 4695040 | elapsed time per iteration (ms): 5638.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.892716E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:29:17.547401 | finish at 2025-09-10 11:47:25 + [2025-09-10 00:18:13] iteration 4586/ 11920 | consumed samples: 4696064 | elapsed time per iteration (ms): 5630.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.903954E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:28:14.458269 | finish at 2025-09-10 11:46:28 + [2025-09-10 00:18:19] iteration 4587/ 11920 | consumed samples: 4697088 | elapsed time per iteration (ms): 5629.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.921148E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:27:57.795797 | finish at 2025-09-10 11:46:17 + [2025-09-10 00:18:25] iteration 4588/ 11920 | consumed samples: 4698112 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.925749E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:26:58.902600 | finish at 2025-09-10 11:45:23 + [2025-09-10 00:18:30] iteration 4589/ 11920 | consumed samples: 4699136 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.918190E+00 | loss scale: 1.0 | grad norm: 0.229 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:27:08.282586 | finish at 2025-09-10 11:45:38 + [2025-09-10 00:18:36] iteration 4590/ 11920 | consumed samples: 4700160 | elapsed time per iteration (ms): 5632.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.927824E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:28:06.596751 | finish at 2025-09-10 11:46:42 + [2025-09-10 00:18:41] iteration 4591/ 11920 | consumed samples: 4701184 | elapsed time per iteration (ms): 5632.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.928107E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:28:03.504876 | finish at 2025-09-10 11:46:45 + [2025-09-10 00:18:47] iteration 4592/ 11920 | consumed samples: 4702208 | elapsed time per iteration (ms): 5634.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.911740E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:28:08.178307 | finish at 2025-09-10 11:46:55 + [2025-09-10 00:18:53] iteration 4593/ 11920 | consumed samples: 4703232 | elapsed time per iteration (ms): 5635.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.914676E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:28:08.010031 | finish at 2025-09-10 11:47:01 + [2025-09-10 00:18:58] iteration 4594/ 11920 | consumed samples: 4704256 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.926188E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:26:28.074851 | finish at 2025-09-10 11:45:26 + [2025-09-10 00:19:04] iteration 4595/ 11920 | consumed samples: 4705280 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.912492E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:26:05.074086 | finish at 2025-09-10 11:45:09 + [2025-09-10 00:19:10] iteration 4596/ 11920 | consumed samples: 4706304 | elapsed time per iteration (ms): 5634.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.910944E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:27:50.057175 | finish at 2025-09-10 11:47:00 + [2025-09-10 00:19:15] iteration 4597/ 11920 | consumed samples: 4707328 | elapsed time per iteration (ms): 5638.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.902353E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:28:06.843620 | finish at 2025-09-10 11:47:22 + [2025-09-10 00:19:21] iteration 4598/ 11920 | consumed samples: 4708352 | elapsed time per iteration (ms): 5627.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.904613E+00 | loss scale: 1.0 | grad norm: 0.230 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:26:44.010760 | finish at 2025-09-10 11:46:05 + [2025-09-10 00:19:26] iteration 4599/ 11920 | consumed samples: 4709376 | elapsed time per iteration (ms): 5626.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.914430E+00 | loss scale: 1.0 | grad norm: 0.240 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:26:32.714073 | finish at 2025-09-10 11:45:59 + [2025-09-10 00:19:32] iteration 4600/ 11920 | consumed samples: 4710400 | elapsed time per iteration (ms): 5626.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.910075E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:26:26.904173 | finish at 2025-09-10 11:45:59 + [2025-09-10 00:19:38] iteration 4601/ 11920 | consumed samples: 4711424 | elapsed time per iteration (ms): 5853.3 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.919988E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:53:59.996275 | finish at 2025-09-10 12:13:38 + [2025-09-10 00:19:45] iteration 4602/ 11920 | consumed samples: 4712448 | elapsed time per iteration (ms): 6584.0 | throughput per GPU (TFLOP/s/GPU): 68.6 | MFU 6.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.899195E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 13:23:01.988480 | finish at 2025-09-10 13:42:47 + [2025-09-10 00:19:50] iteration 4603/ 11920 | consumed samples: 4713472 | elapsed time per iteration (ms): 5628.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.917463E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:26:22.330059 | finish at 2025-09-10 11:46:12 + [2025-09-10 00:19:56] iteration 4604/ 11920 | consumed samples: 4714496 | elapsed time per iteration (ms): 6017.4 | throughput per GPU (TFLOP/s/GPU): 75.0 | MFU 7.59% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.912107E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:13:43.169517 | finish at 2025-09-10 12:33:39 + [2025-09-10 00:20:02] iteration 4605/ 11920 | consumed samples: 4715520 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.923487E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:25:50.800816 | finish at 2025-09-10 11:45:53 + [2025-09-10 00:20:08] iteration 4606/ 11920 | consumed samples: 4716544 | elapsed time per iteration (ms): 5853.6 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.911314E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:53:33.183502 | finish at 2025-09-10 12:13:41 + [2025-09-10 00:20:13] iteration 4607/ 11920 | consumed samples: 4717568 | elapsed time per iteration (ms): 5639.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.923314E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:27:21.922578 | finish at 2025-09-10 11:47:35 + [2025-09-10 00:20:19] iteration 4608/ 11920 | consumed samples: 4718592 | elapsed time per iteration (ms): 5881.2 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.896742E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:56:43.608650 | finish at 2025-09-10 12:17:03 + [2025-09-10 00:20:25] iteration 4609/ 11920 | consumed samples: 4719616 | elapsed time per iteration (ms): 5627.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.916482E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:25:43.805093 | finish at 2025-09-10 11:46:09 + [2025-09-10 00:20:31] iteration 4610/ 11920 | consumed samples: 4720640 | elapsed time per iteration (ms): 6203.2 | throughput per GPU (TFLOP/s/GPU): 72.8 | MFU 7.36% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930466E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:35:45.275974 | finish at 2025-09-10 12:56:16 + [2025-09-10 00:20:37] iteration 4611/ 11920 | consumed samples: 4721664 | elapsed time per iteration (ms): 5865.1 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.916321E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:54:28.257441 | finish at 2025-09-10 12:15:05 + [2025-09-10 00:20:42] iteration 4612/ 11920 | consumed samples: 4722688 | elapsed time per iteration (ms): 5628.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.913374E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:25:36.029451 | finish at 2025-09-10 11:46:19 + [2025-09-10 00:20:48] iteration 4613/ 11920 | consumed samples: 4723712 | elapsed time per iteration (ms): 5883.3 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.909569E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:56:29.635944 | finish at 2025-09-10 12:17:18 + [2025-09-10 00:20:54] iteration 4614/ 11920 | consumed samples: 4724736 | elapsed time per iteration (ms): 5863.3 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.910205E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 13.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:53:56.989978 | finish at 2025-09-10 12:14:51 + [2025-09-10 00:21:00] iteration 4615/ 11920 | consumed samples: 4725760 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.894294E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:24:58.741078 | finish at 2025-09-10 11:45:59 + [2025-09-10 00:21:05] iteration 4616/ 11920 | consumed samples: 4726784 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.927541E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:24:32.212831 | finish at 2025-09-10 11:45:38 + [2025-09-10 00:21:11] iteration 4617/ 11920 | consumed samples: 4727808 | elapsed time per iteration (ms): 5999.0 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.889125E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:10:10.577389 | finish at 2025-09-10 12:31:22 + [2025-09-10 00:21:17] iteration 4618/ 11920 | consumed samples: 4728832 | elapsed time per iteration (ms): 5831.1 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.902917E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:49:38.587649 | finish at 2025-09-10 12:10:56 + [2025-09-10 00:21:23] iteration 4619/ 11920 | consumed samples: 4729856 | elapsed time per iteration (ms): 5887.6 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.911689E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:56:25.536046 | finish at 2025-09-10 12:17:49 + [2025-09-10 00:21:29] iteration 4620/ 11920 | consumed samples: 4730880 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.907183E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:24:26.214132 | finish at 2025-09-10 11:45:55 + [2025-09-10 00:21:34] iteration 4621/ 11920 | consumed samples: 4731904 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.914798E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:23:49.256013 | finish at 2025-09-10 11:45:24 + [2025-09-10 00:21:40] iteration 4622/ 11920 | consumed samples: 4732928 | elapsed time per iteration (ms): 5627.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.908765E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:24:30.318428 | finish at 2025-09-10 11:46:10 + [2025-09-10 00:21:46] iteration 4623/ 11920 | consumed samples: 4733952 | elapsed time per iteration (ms): 5627.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.904234E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:24:23.490394 | finish at 2025-09-10 11:46:09 + [2025-09-10 00:21:51] iteration 4624/ 11920 | consumed samples: 4734976 | elapsed time per iteration (ms): 5630.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.911243E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:24:43.160522 | finish at 2025-09-10 11:46:35 + [2025-09-10 00:21:57] iteration 4625/ 11920 | consumed samples: 4736000 | elapsed time per iteration (ms): 5869.6 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.908182E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:53:38.826340 | finish at 2025-09-10 12:15:36 + [2025-09-10 00:22:03] iteration 4626/ 11920 | consumed samples: 4737024 | elapsed time per iteration (ms): 5629.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.904432E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:24:18.628191 | finish at 2025-09-10 11:46:21 + [2025-09-10 00:22:08] iteration 4627/ 11920 | consumed samples: 4738048 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.912320E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:23:48.795183 | finish at 2025-09-10 11:45:57 + [2025-09-10 00:22:14] iteration 4628/ 11920 | consumed samples: 4739072 | elapsed time per iteration (ms): 5938.8 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.906621E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:01:45.742474 | finish at 2025-09-10 12:24:00 + [2025-09-10 00:22:20] iteration 4629/ 11920 | consumed samples: 4740096 | elapsed time per iteration (ms): 5629.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.911420E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:24:05.620809 | finish at 2025-09-10 11:46:26 + [2025-09-10 00:22:26] iteration 4630/ 11920 | consumed samples: 4741120 | elapsed time per iteration (ms): 5628.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.893826E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:23:53.337843 | finish at 2025-09-10 11:46:19 + [2025-09-10 00:22:32] iteration 4631/ 11920 | consumed samples: 4742144 | elapsed time per iteration (ms): 5997.8 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.910229E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:08:38.105779 | finish at 2025-09-10 12:31:10 + [2025-09-10 00:22:37] iteration 4632/ 11920 | consumed samples: 4743168 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.900426E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:23:33.263857 | finish at 2025-09-10 11:46:11 + [2025-09-10 00:22:43] iteration 4633/ 11920 | consumed samples: 4744192 | elapsed time per iteration (ms): 5628.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.895382E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:23:36.773107 | finish at 2025-09-10 11:46:20 + [2025-09-10 00:22:49] iteration 4634/ 11920 | consumed samples: 4745216 | elapsed time per iteration (ms): 5951.9 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.908609E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:02:45.409681 | finish at 2025-09-10 12:25:34 + [2025-09-10 00:22:55] iteration 4635/ 11920 | consumed samples: 4746240 | elapsed time per iteration (ms): 5992.7 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.910061E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:07:36.985232 | finish at 2025-09-10 12:30:32 + [2025-09-10 00:23:00] iteration 4636/ 11920 | consumed samples: 4747264 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.904561E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:22:39.410937 | finish at 2025-09-10 11:45:40 + [2025-09-10 00:23:06] iteration 4637/ 11920 | consumed samples: 4748288 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.917919E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:22:31.295997 | finish at 2025-09-10 11:45:37 + [2025-09-10 00:23:12] iteration 4638/ 11920 | consumed samples: 4749312 | elapsed time per iteration (ms): 5618.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.890255E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:21:51.302295 | finish at 2025-09-10 11:45:03 + [2025-09-10 00:23:17] iteration 4639/ 11920 | consumed samples: 4750336 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.908116E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:22:15.903148 | finish at 2025-09-10 11:45:33 + [2025-09-10 00:23:23] iteration 4640/ 11920 | consumed samples: 4751360 | elapsed time per iteration (ms): 5630.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.898509E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:23:06.131802 | finish at 2025-09-10 11:46:29 + [2025-09-10 00:23:29] iteration 4641/ 11920 | consumed samples: 4752384 | elapsed time per iteration (ms): 5824.4 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.84% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.907136E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:46:35.822841 | finish at 2025-09-10 12:10:05 + [2025-09-10 00:23:34] iteration 4642/ 11920 | consumed samples: 4753408 | elapsed time per iteration (ms): 5625.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.905921E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:22:21.637390 | finish at 2025-09-10 11:45:56 + [2025-09-10 00:23:40] iteration 4643/ 11920 | consumed samples: 4754432 | elapsed time per iteration (ms): 5630.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.907698E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:22:51.356843 | finish at 2025-09-10 11:46:31 + [2025-09-10 00:23:46] iteration 4644/ 11920 | consumed samples: 4755456 | elapsed time per iteration (ms): 5631.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.909305E+00 | loss scale: 1.0 | grad norm: 0.242 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:22:57.677169 | finish at 2025-09-10 11:46:43 + [2025-09-10 00:23:51] iteration 4645/ 11920 | consumed samples: 4756480 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.905896E+00 | loss scale: 1.0 | grad norm: 0.250 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:22:09.903978 | finish at 2025-09-10 11:46:01 + [2025-09-10 00:23:57] iteration 4646/ 11920 | consumed samples: 4757504 | elapsed time per iteration (ms): 5627.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.908070E+00 | loss scale: 1.0 | grad norm: 0.242 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:22:12.508657 | finish at 2025-09-10 11:46:09 + [2025-09-10 00:24:03] iteration 4647/ 11920 | consumed samples: 4758528 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.912754E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:21:31.146772 | finish at 2025-09-10 11:45:34 + [2025-09-10 00:24:08] iteration 4648/ 11920 | consumed samples: 4759552 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.917973E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:21:42.768625 | finish at 2025-09-10 11:45:51 + [2025-09-10 00:24:14] iteration 4649/ 11920 | consumed samples: 4760576 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.914331E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:21:08.973883 | finish at 2025-09-10 11:45:23 + [2025-09-10 00:24:20] iteration 4650/ 11920 | consumed samples: 4761600 | elapsed time per iteration (ms): 5934.2 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.905204E+00 | loss scale: 1.0 | grad norm: 0.252 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:59:01.395156 | finish at 2025-09-10 12:23:21 + [2025-09-10 00:24:25] iteration 4651/ 11920 | consumed samples: 4762624 | elapsed time per iteration (ms): 5635.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.928372E+00 | loss scale: 1.0 | grad norm: 0.248 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:22:40.709214 | finish at 2025-09-10 11:47:06 + [2025-09-10 00:24:31] iteration 4652/ 11920 | consumed samples: 4763648 | elapsed time per iteration (ms): 5628.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.915364E+00 | loss scale: 1.0 | grad norm: 0.251 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:21:45.674818 | finish at 2025-09-10 11:46:17 + [2025-09-10 00:24:37] iteration 4653/ 11920 | consumed samples: 4764672 | elapsed time per iteration (ms): 5634.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.917397E+00 | loss scale: 1.0 | grad norm: 0.277 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:22:23.550177 | finish at 2025-09-10 11:47:00 + [2025-09-10 00:24:42] iteration 4654/ 11920 | consumed samples: 4765696 | elapsed time per iteration (ms): 5628.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.937614E+00 | loss scale: 1.0 | grad norm: 0.252 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:21:35.981020 | finish at 2025-09-10 11:46:18 + [2025-09-10 00:24:48] iteration 4655/ 11920 | consumed samples: 4766720 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.919299E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:21:09.016473 | finish at 2025-09-10 11:45:57 + [2025-09-10 00:24:54] iteration 4656/ 11920 | consumed samples: 4767744 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.922879E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:20:39.215797 | finish at 2025-09-10 11:45:33 + [2025-09-10 00:24:59] iteration 4657/ 11920 | consumed samples: 4768768 | elapsed time per iteration (ms): 5618.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.911917E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:20:06.474536 | finish at 2025-09-10 11:45:06 + [2025-09-10 00:25:05] iteration 4658/ 11920 | consumed samples: 4769792 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.922127E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:20:10.789149 | finish at 2025-09-10 11:45:16 + [2025-09-10 00:25:10] iteration 4659/ 11920 | consumed samples: 4770816 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.915536E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:20:26.859046 | finish at 2025-09-10 11:45:37 + [2025-09-10 00:25:16] iteration 4660/ 11920 | consumed samples: 4771840 | elapsed time per iteration (ms): 5621.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.896025E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:20:08.557305 | finish at 2025-09-10 11:45:25 + [2025-09-10 00:25:22] iteration 4661/ 11920 | consumed samples: 4772864 | elapsed time per iteration (ms): 5633.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.915715E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:21:33.499339 | finish at 2025-09-10 11:46:55 + [2025-09-10 00:25:27] iteration 4662/ 11920 | consumed samples: 4773888 | elapsed time per iteration (ms): 5630.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.909991E+00 | loss scale: 1.0 | grad norm: 0.248 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:21:07.538347 | finish at 2025-09-10 11:46:35 + [2025-09-10 00:25:33] iteration 4663/ 11920 | consumed samples: 4774912 | elapsed time per iteration (ms): 5626.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.924069E+00 | loss scale: 1.0 | grad norm: 0.253 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:20:31.660239 | finish at 2025-09-10 11:46:05 + [2025-09-10 00:25:39] iteration 4664/ 11920 | consumed samples: 4775936 | elapsed time per iteration (ms): 5627.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.913841E+00 | loss scale: 1.0 | grad norm: 0.261 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:20:34.387720 | finish at 2025-09-10 11:46:13 + [2025-09-10 00:25:44] iteration 4665/ 11920 | consumed samples: 4776960 | elapsed time per iteration (ms): 5633.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.908104E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:21:08.585278 | finish at 2025-09-10 11:46:53 + [2025-09-10 00:25:50] iteration 4666/ 11920 | consumed samples: 4777984 | elapsed time per iteration (ms): 5628.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.910624E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:20:26.378626 | finish at 2025-09-10 11:46:16 + [2025-09-10 00:25:55] iteration 4667/ 11920 | consumed samples: 4779008 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.910751E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:19:37.192429 | finish at 2025-09-10 11:45:33 + [2025-09-10 00:26:01] iteration 4668/ 11920 | consumed samples: 4780032 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.916001E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:19:54.721780 | finish at 2025-09-10 11:45:56 + [2025-09-10 00:26:07] iteration 4669/ 11920 | consumed samples: 4781056 | elapsed time per iteration (ms): 5634.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.895210E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:20:52.578751 | finish at 2025-09-10 11:46:59 + [2025-09-10 00:26:12] iteration 4670/ 11920 | consumed samples: 4782080 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.905003E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:19:24.313817 | finish at 2025-09-10 11:45:37 + [2025-09-10 00:26:18] iteration 4671/ 11920 | consumed samples: 4783104 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.917060E+00 | loss scale: 1.0 | grad norm: 0.240 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:19:25.609523 | finish at 2025-09-10 11:45:44 + [2025-09-10 00:26:24] iteration 4672/ 11920 | consumed samples: 4784128 | elapsed time per iteration (ms): 5628.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.910037E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:19:55.710045 | finish at 2025-09-10 11:46:19 + [2025-09-10 00:26:29] iteration 4673/ 11920 | consumed samples: 4785152 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.917178E+00 | loss scale: 1.0 | grad norm: 0.253 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:18:58.245187 | finish at 2025-09-10 11:45:27 + [2025-09-10 00:26:35] iteration 4674/ 11920 | consumed samples: 4786176 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.899770E+00 | loss scale: 1.0 | grad norm: 0.256 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:19:07.265041 | finish at 2025-09-10 11:45:42 + [2025-09-10 00:26:41] iteration 4675/ 11920 | consumed samples: 4787200 | elapsed time per iteration (ms): 5853.5 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.914648E+00 | loss scale: 1.0 | grad norm: 0.241 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:46:48.900347 | finish at 2025-09-10 12:13:30 + [2025-09-10 00:26:46] iteration 4676/ 11920 | consumed samples: 4788224 | elapsed time per iteration (ms): 5626.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.919850E+00 | loss scale: 1.0 | grad norm: 0.264 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:19:21.394553 | finish at 2025-09-10 11:46:08 + [2025-09-10 00:26:52] iteration 4677/ 11920 | consumed samples: 4789248 | elapsed time per iteration (ms): 5627.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.917998E+00 | loss scale: 1.0 | grad norm: 0.271 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:19:16.591350 | finish at 2025-09-10 11:46:08 + [2025-09-10 00:26:58] iteration 4678/ 11920 | consumed samples: 4790272 | elapsed time per iteration (ms): 5629.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.909774E+00 | loss scale: 1.0 | grad norm: 0.259 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:19:30.602978 | finish at 2025-09-10 11:46:28 + [2025-09-10 00:27:03] iteration 4679/ 11920 | consumed samples: 4791296 | elapsed time per iteration (ms): 5618.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.915366E+00 | loss scale: 1.0 | grad norm: 0.230 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:18:03.347839 | finish at 2025-09-10 11:45:06 + [2025-09-10 00:27:09] iteration 4680/ 11920 | consumed samples: 4792320 | elapsed time per iteration (ms): 5623.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.922384E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:18:30.286293 | finish at 2025-09-10 11:45:39 + [2025-09-10 00:27:14] iteration 4681/ 11920 | consumed samples: 4793344 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.918214E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:18:19.832497 | finish at 2025-09-10 11:45:34 + [2025-09-10 00:27:20] iteration 4682/ 11920 | consumed samples: 4794368 | elapsed time per iteration (ms): 5635.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930243E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:19:45.964266 | finish at 2025-09-10 11:47:06 + [2025-09-10 00:27:26] iteration 4683/ 11920 | consumed samples: 4795392 | elapsed time per iteration (ms): 5626.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.920759E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:18:38.505219 | finish at 2025-09-10 11:46:04 + [2025-09-10 00:27:31] iteration 4684/ 11920 | consumed samples: 4796416 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.919714E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:18:12.693981 | finish at 2025-09-10 11:45:44 + [2025-09-10 00:27:37] iteration 4685/ 11920 | consumed samples: 4797440 | elapsed time per iteration (ms): 5627.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.924310E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:18:31.387075 | finish at 2025-09-10 11:46:08 + [2025-09-10 00:27:43] iteration 4686/ 11920 | consumed samples: 4798464 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.910172E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:18:12.229641 | finish at 2025-09-10 11:45:55 + [2025-09-10 00:27:48] iteration 4687/ 11920 | consumed samples: 4799488 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.906493E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:18:09.355054 | finish at 2025-09-10 11:45:58 + [2025-09-10 00:27:54] iteration 4688/ 11920 | consumed samples: 4800512 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.913307E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:17:24.627151 | finish at 2025-09-10 11:45:18 + [2025-09-10 00:27:59] iteration 4689/ 11920 | consumed samples: 4801536 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.902567E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:17:14.602211 | finish at 2025-09-10 11:45:14 + [2025-09-10 00:28:05] iteration 4690/ 11920 | consumed samples: 4802560 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.901970E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:17:13.726516 | finish at 2025-09-10 11:45:19 + [2025-09-10 00:28:11] iteration 4691/ 11920 | consumed samples: 4803584 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.908789E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:17:31.679051 | finish at 2025-09-10 11:45:42 + [2025-09-10 00:28:16] iteration 4692/ 11920 | consumed samples: 4804608 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.905423E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:17:54.331368 | finish at 2025-09-10 11:46:11 + [2025-09-10 00:28:22] iteration 4693/ 11920 | consumed samples: 4805632 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.909002E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:17:06.592672 | finish at 2025-09-10 11:45:28 + [2025-09-10 00:28:28] iteration 4694/ 11920 | consumed samples: 4806656 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.918787E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:17:01.436329 | finish at 2025-09-10 11:45:29 + [2025-09-10 00:28:33] iteration 4695/ 11920 | consumed samples: 4807680 | elapsed time per iteration (ms): 5628.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.903808E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:17:42.153733 | finish at 2025-09-10 11:46:15 + [2025-09-10 00:28:39] iteration 4696/ 11920 | consumed samples: 4808704 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.914629E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:16:43.796442 | finish at 2025-09-10 11:45:23 + [2025-09-10 00:28:44] iteration 4697/ 11920 | consumed samples: 4809728 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.897571E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:17:13.475314 | finish at 2025-09-10 11:45:58 + [2025-09-10 00:28:50] iteration 4698/ 11920 | consumed samples: 4810752 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.917233E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:16:27.286192 | finish at 2025-09-10 11:45:17 + [2025-09-10 00:28:56] iteration 4699/ 11920 | consumed samples: 4811776 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.900717E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:16:22.516722 | finish at 2025-09-10 11:45:18 + [2025-09-10 00:29:01] iteration 4700/ 11920 | consumed samples: 4812800 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.900083E+00 | loss scale: 1.0 | grad norm: 0.142 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:16:53.649883 | finish at 2025-09-10 11:45:55 + [2025-09-10 00:29:07] iteration 4701/ 11920 | consumed samples: 4813824 | elapsed time per iteration (ms): 5626.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.898690E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:16:59.189783 | finish at 2025-09-10 11:46:06 + [2025-09-10 00:29:13] iteration 4702/ 11920 | consumed samples: 4814848 | elapsed time per iteration (ms): 5630.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.906369E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:17:23.783895 | finish at 2025-09-10 11:46:36 + [2025-09-10 00:29:18] iteration 4703/ 11920 | consumed samples: 4815872 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.909344E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:16:35.384107 | finish at 2025-09-10 11:45:54 + [2025-09-10 00:29:24] iteration 4704/ 11920 | consumed samples: 4816896 | elapsed time per iteration (ms): 5631.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.907547E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:17:17.609417 | finish at 2025-09-10 11:46:41 + [2025-09-10 00:29:29] iteration 4705/ 11920 | consumed samples: 4817920 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.895069E+00 | loss scale: 1.0 | grad norm: 0.261 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:16:03.524576 | finish at 2025-09-10 11:45:33 + [2025-09-10 00:29:35] iteration 4706/ 11920 | consumed samples: 4818944 | elapsed time per iteration (ms): 5829.8 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.895143E+00 | loss scale: 1.0 | grad norm: 0.254 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:40:56.042254 | finish at 2025-09-10 12:10:31 + [2025-09-10 00:29:41] iteration 4707/ 11920 | consumed samples: 4819968 | elapsed time per iteration (ms): 5636.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.896750E+00 | loss scale: 1.0 | grad norm: 0.248 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:17:33.641973 | finish at 2025-09-10 11:47:14 + [2025-09-10 00:29:46] iteration 4708/ 11920 | consumed samples: 4820992 | elapsed time per iteration (ms): 5633.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.918674E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:17:07.091838 | finish at 2025-09-10 11:46:54 + [2025-09-10 00:29:52] iteration 4709/ 11920 | consumed samples: 4822016 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.918757E+00 | loss scale: 1.0 | grad norm: 0.240 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:15:31.513286 | finish at 2025-09-10 11:45:24 + [2025-09-10 00:29:58] iteration 4710/ 11920 | consumed samples: 4823040 | elapsed time per iteration (ms): 5930.7 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.919890E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:52:40.206501 | finish at 2025-09-10 12:22:38 + [2025-09-10 00:30:04] iteration 4711/ 11920 | consumed samples: 4824064 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.896718E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:15:29.501445 | finish at 2025-09-10 11:45:33 + [2025-09-10 00:30:09] iteration 4712/ 11920 | consumed samples: 4825088 | elapsed time per iteration (ms): 5629.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.920796E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:16:17.847813 | finish at 2025-09-10 11:46:27 + [2025-09-10 00:30:15] iteration 4713/ 11920 | consumed samples: 4826112 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.907654E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:15:41.897439 | finish at 2025-09-10 11:45:57 + [2025-09-10 00:30:21] iteration 4714/ 11920 | consumed samples: 4827136 | elapsed time per iteration (ms): 5624.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.918029E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:15:31.045798 | finish at 2025-09-10 11:45:52 + [2025-09-10 00:30:26] iteration 4715/ 11920 | consumed samples: 4828160 | elapsed time per iteration (ms): 5628.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.911233E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:15:55.457009 | finish at 2025-09-10 11:46:22 + [2025-09-10 00:30:32] iteration 4716/ 11920 | consumed samples: 4829184 | elapsed time per iteration (ms): 5837.0 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.915071E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:40:50.013453 | finish at 2025-09-10 12:11:22 + [2025-09-10 00:30:38] iteration 4717/ 11920 | consumed samples: 4830208 | elapsed time per iteration (ms): 6251.8 | throughput per GPU (TFLOP/s/GPU): 72.2 | MFU 7.30% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.909843E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:30:31.896154 | finish at 2025-09-10 13:01:10 + [2025-09-10 00:30:44] iteration 4718/ 11920 | consumed samples: 4831232 | elapsed time per iteration (ms): 6088.2 | throughput per GPU (TFLOP/s/GPU): 74.2 | MFU 7.50% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.903609E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:10:47.390491 | finish at 2025-09-10 12:41:32 + [2025-09-10 00:30:50] iteration 4719/ 11920 | consumed samples: 4832256 | elapsed time per iteration (ms): 6081.2 | throughput per GPU (TFLOP/s/GPU): 74.2 | MFU 7.51% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.900503E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:09:50.485159 | finish at 2025-09-10 12:40:41 + [2025-09-10 00:30:56] iteration 4720/ 11920 | consumed samples: 4833280 | elapsed time per iteration (ms): 5616.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.901975E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:13:55.122299 | finish at 2025-09-10 11:44:51 + [2025-09-10 00:31:02] iteration 4721/ 11920 | consumed samples: 4834304 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.895123E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:14:52.440645 | finish at 2025-09-10 11:45:54 + [2025-09-10 00:31:07] iteration 4722/ 11920 | consumed samples: 4835328 | elapsed time per iteration (ms): 5626.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.912539E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:14:58.954150 | finish at 2025-09-10 11:46:06 + [2025-09-10 00:31:13] iteration 4723/ 11920 | consumed samples: 4836352 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.902977E+00 | loss scale: 1.0 | grad norm: 0.246 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:14:41.939314 | finish at 2025-09-10 11:45:55 + [2025-09-10 00:31:19] iteration 4724/ 11920 | consumed samples: 4837376 | elapsed time per iteration (ms): 5626.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.916135E+00 | loss scale: 1.0 | grad norm: 0.245 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:14:46.037125 | finish at 2025-09-10 11:46:05 + [2025-09-10 00:31:24] iteration 4725/ 11920 | consumed samples: 4838400 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.931619E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:14:07.924283 | finish at 2025-09-10 11:45:32 + [2025-09-10 00:31:30] iteration 4726/ 11920 | consumed samples: 4839424 | elapsed time per iteration (ms): 5838.0 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.918612E+00 | loss scale: 1.0 | grad norm: 0.242 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:39:58.634172 | finish at 2025-09-10 12:11:29 + [2025-09-10 00:31:36] iteration 4727/ 11920 | consumed samples: 4840448 | elapsed time per iteration (ms): 6253.6 | throughput per GPU (TFLOP/s/GPU): 72.2 | MFU 7.30% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.914816E+00 | loss scale: 1.0 | grad norm: 0.253 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:29:42.126803 | finish at 2025-09-10 13:01:18 + [2025-09-10 00:31:42] iteration 4728/ 11920 | consumed samples: 4841472 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.905194E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:14:11.687187 | finish at 2025-09-10 11:45:54 + [2025-09-10 00:31:48] iteration 4729/ 11920 | consumed samples: 4842496 | elapsed time per iteration (ms): 5880.3 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.923992E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:44:45.522225 | finish at 2025-09-10 12:16:33 + [2025-09-10 00:31:54] iteration 4730/ 11920 | consumed samples: 4843520 | elapsed time per iteration (ms): 5948.7 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.907544E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:52:51.478353 | finish at 2025-09-10 12:24:45 + [2025-09-10 00:32:00] iteration 4731/ 11920 | consumed samples: 4844544 | elapsed time per iteration (ms): 5969.2 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.923321E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:55:12.260963 | finish at 2025-09-10 12:27:12 + [2025-09-10 00:32:05] iteration 4732/ 11920 | consumed samples: 4845568 | elapsed time per iteration (ms): 5624.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.920979E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:13:49.516356 | finish at 2025-09-10 11:45:55 + [2025-09-10 00:32:11] iteration 4733/ 11920 | consumed samples: 4846592 | elapsed time per iteration (ms): 5638.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.913083E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:15:22.687867 | finish at 2025-09-10 11:47:34 + [2025-09-10 00:32:17] iteration 4734/ 11920 | consumed samples: 4847616 | elapsed time per iteration (ms): 5622.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.906149E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:13:23.529587 | finish at 2025-09-10 11:45:40 + [2025-09-10 00:32:23] iteration 4735/ 11920 | consumed samples: 4848640 | elapsed time per iteration (ms): 6120.8 | throughput per GPU (TFLOP/s/GPU): 73.8 | MFU 7.46% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.908836E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:12:57.735715 | finish at 2025-09-10 12:45:20 + [2025-09-10 00:32:28] iteration 4736/ 11920 | consumed samples: 4849664 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.913920E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:13:15.006157 | finish at 2025-09-10 11:45:43 + [2025-09-10 00:32:34] iteration 4737/ 11920 | consumed samples: 4850688 | elapsed time per iteration (ms): 5617.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.909117E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:12:32.552915 | finish at 2025-09-10 11:45:06 + [2025-09-10 00:32:40] iteration 4738/ 11920 | consumed samples: 4851712 | elapsed time per iteration (ms): 5641.5 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.904534E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:15:17.117699 | finish at 2025-09-10 11:47:57 + [2025-09-10 00:32:45] iteration 4739/ 11920 | consumed samples: 4852736 | elapsed time per iteration (ms): 5641.9 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.886320E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:15:14.646997 | finish at 2025-09-10 11:48:00 + [2025-09-10 00:32:51] iteration 4740/ 11920 | consumed samples: 4853760 | elapsed time per iteration (ms): 5627.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.904659E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:13:25.032721 | finish at 2025-09-10 11:46:16 + [2025-09-10 00:32:56] iteration 4741/ 11920 | consumed samples: 4854784 | elapsed time per iteration (ms): 5615.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.918816E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:11:56.738085 | finish at 2025-09-10 11:44:53 + [2025-09-10 00:33:02] iteration 4742/ 11920 | consumed samples: 4855808 | elapsed time per iteration (ms): 5835.3 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.905962E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:38:05.851630 | finish at 2025-09-10 12:11:08 + [2025-09-10 00:33:08] iteration 4743/ 11920 | consumed samples: 4856832 | elapsed time per iteration (ms): 5844.5 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.896148E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:39:05.947876 | finish at 2025-09-10 12:12:14 + [2025-09-10 00:33:14] iteration 4744/ 11920 | consumed samples: 4857856 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.886635E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:12:09.733389 | finish at 2025-09-10 11:45:23 + [2025-09-10 00:33:19] iteration 4745/ 11920 | consumed samples: 4858880 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.913727E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:12:26.805120 | finish at 2025-09-10 11:45:46 + [2025-09-10 00:33:25] iteration 4746/ 11920 | consumed samples: 4859904 | elapsed time per iteration (ms): 5630.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.901948E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:13:10.522210 | finish at 2025-09-10 11:46:36 + [2025-09-10 00:33:31] iteration 4747/ 11920 | consumed samples: 4860928 | elapsed time per iteration (ms): 5829.1 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.906471E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:36:52.444789 | finish at 2025-09-10 12:10:23 + [2025-09-10 00:33:36] iteration 4748/ 11920 | consumed samples: 4861952 | elapsed time per iteration (ms): 5631.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.906003E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:13:06.306903 | finish at 2025-09-10 11:46:43 + [2025-09-10 00:33:42] iteration 4749/ 11920 | consumed samples: 4862976 | elapsed time per iteration (ms): 5629.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.900501E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:12:50.097883 | finish at 2025-09-10 11:46:32 + [2025-09-10 00:33:48] iteration 4750/ 11920 | consumed samples: 4864000 | elapsed time per iteration (ms): 5627.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.900920E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:12:26.942854 | finish at 2025-09-10 11:46:15 + [2025-09-10 00:33:53] iteration 4751/ 11920 | consumed samples: 4865024 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.907243E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:11:30.469706 | finish at 2025-09-10 11:45:24 + [2025-09-10 00:33:59] iteration 4752/ 11920 | consumed samples: 4866048 | elapsed time per iteration (ms): 5617.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.921906E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:11:02.921631 | finish at 2025-09-10 11:45:02 + [2025-09-10 00:34:05] iteration 4753/ 11920 | consumed samples: 4867072 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.903463E+00 | loss scale: 1.0 | grad norm: 0.260 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:11:48.013339 | finish at 2025-09-10 11:45:53 + [2025-09-10 00:34:10] iteration 4754/ 11920 | consumed samples: 4868096 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.907130E+00 | loss scale: 1.0 | grad norm: 0.263 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:11:21.400211 | finish at 2025-09-10 11:45:32 + [2025-09-10 00:34:16] iteration 4755/ 11920 | consumed samples: 4869120 | elapsed time per iteration (ms): 5860.6 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.905442E+00 | loss scale: 1.0 | grad norm: 0.295 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:39:51.277542 | finish at 2025-09-10 12:14:07 + [2025-09-10 00:34:22] iteration 4756/ 11920 | consumed samples: 4870144 | elapsed time per iteration (ms): 5637.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.918072E+00 | loss scale: 1.0 | grad norm: 0.289 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:13:06.582341 | finish at 2025-09-10 11:47:28 + [2025-09-10 00:34:27] iteration 4757/ 11920 | consumed samples: 4871168 | elapsed time per iteration (ms): 5631.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.908175E+00 | loss scale: 1.0 | grad norm: 0.274 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:12:15.217060 | finish at 2025-09-10 11:46:43 + [2025-09-10 00:34:34] iteration 4758/ 11920 | consumed samples: 4872192 | elapsed time per iteration (ms): 6261.9 | throughput per GPU (TFLOP/s/GPU): 72.1 | MFU 7.29% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.907545E+00 | loss scale: 1.0 | grad norm: 0.247 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:27:27.802561 | finish at 2025-09-10 13:02:01 + [2025-09-10 00:34:39] iteration 4759/ 11920 | consumed samples: 4873216 | elapsed time per iteration (ms): 5632.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.919083E+00 | loss scale: 1.0 | grad norm: 0.247 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:12:11.557634 | finish at 2025-09-10 11:46:51 + [2025-09-10 00:34:45] iteration 4760/ 11920 | consumed samples: 4874240 | elapsed time per iteration (ms): 5898.6 | throughput per GPU (TFLOP/s/GPU): 76.5 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.924942E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:43:53.800898 | finish at 2025-09-10 12:18:39 + [2025-09-10 00:34:51] iteration 4761/ 11920 | consumed samples: 4875264 | elapsed time per iteration (ms): 5625.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.915972E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:11:15.557169 | finish at 2025-09-10 11:46:06 + [2025-09-10 00:34:56] iteration 4762/ 11920 | consumed samples: 4876288 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.905956E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:10:36.947844 | finish at 2025-09-10 11:45:33 + [2025-09-10 00:35:02] iteration 4763/ 11920 | consumed samples: 4877312 | elapsed time per iteration (ms): 5934.4 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.908595E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:47:52.581583 | finish at 2025-09-10 12:22:55 + [2025-09-10 00:35:08] iteration 4764/ 11920 | consumed samples: 4878336 | elapsed time per iteration (ms): 5629.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.913279E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:11:26.402377 | finish at 2025-09-10 11:46:34 + [2025-09-10 00:35:14] iteration 4765/ 11920 | consumed samples: 4879360 | elapsed time per iteration (ms): 5627.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.897688E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:11:03.773496 | finish at 2025-09-10 11:46:17 + [2025-09-10 00:35:19] iteration 4766/ 11920 | consumed samples: 4880384 | elapsed time per iteration (ms): 5631.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.895719E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:11:26.366056 | finish at 2025-09-10 11:46:46 + [2025-09-10 00:35:25] iteration 4767/ 11920 | consumed samples: 4881408 | elapsed time per iteration (ms): 5635.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.898826E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:11:48.239571 | finish at 2025-09-10 11:47:13 + [2025-09-10 00:35:30] iteration 4768/ 11920 | consumed samples: 4882432 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.921176E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:10:05.416569 | finish at 2025-09-10 11:45:36 +(min, max) time across ranks (ms): + save-checkpoint ................................: (4715.18, 4715.53) + [2025-09-10 00:35:41] iteration 4769/ 11920 | consumed samples: 4883456 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.921682E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:10:00.899801 | finish at 2025-09-10 11:45:42 + [2025-09-10 00:35:47] iteration 4770/ 11920 | consumed samples: 4884480 | elapsed time per iteration (ms): 5831.9 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.917048E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:34:57.797763 | finish at 2025-09-10 12:10:44 + [2025-09-10 00:35:52] iteration 4771/ 11920 | consumed samples: 4885504 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.917059E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:09:43.452153 | finish at 2025-09-10 11:45:36 + [2025-09-10 00:35:58] iteration 4772/ 11920 | consumed samples: 4886528 | elapsed time per iteration (ms): 5628.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.899812E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:10:28.985051 | finish at 2025-09-10 11:46:27 + [2025-09-10 00:36:04] iteration 4773/ 11920 | consumed samples: 4887552 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.917290E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:10:01.852848 | finish at 2025-09-10 11:46:05 + [2025-09-10 00:36:09] iteration 4774/ 11920 | consumed samples: 4888576 | elapsed time per iteration (ms): 5616.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.901598E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:08:55.731481 | finish at 2025-09-10 11:45:05 + [2025-09-10 00:36:15] iteration 4775/ 11920 | consumed samples: 4889600 | elapsed time per iteration (ms): 5623.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.900309E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:09:37.058320 | finish at 2025-09-10 11:45:52 + [2025-09-10 00:36:20] iteration 4776/ 11920 | consumed samples: 4890624 | elapsed time per iteration (ms): 5623.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.910005E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:09:31.261486 | finish at 2025-09-10 11:45:52 + [2025-09-10 00:36:26] iteration 4777/ 11920 | consumed samples: 4891648 | elapsed time per iteration (ms): 5635.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.900088E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:10:54.243337 | finish at 2025-09-10 11:47:20 + [2025-09-10 00:36:32] iteration 4778/ 11920 | consumed samples: 4892672 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.912072E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:09:28.384523 | finish at 2025-09-10 11:46:00 + [2025-09-10 00:36:37] iteration 4779/ 11920 | consumed samples: 4893696 | elapsed time per iteration (ms): 5626.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.899769E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:09:40.986041 | finish at 2025-09-10 11:46:18 + [2025-09-10 00:36:43] iteration 4780/ 11920 | consumed samples: 4894720 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.919168E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:09:27.734599 | finish at 2025-09-10 11:46:11 + [2025-09-10 00:36:49] iteration 4781/ 11920 | consumed samples: 4895744 | elapsed time per iteration (ms): 6171.7 | throughput per GPU (TFLOP/s/GPU): 73.2 | MFU 7.40% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.906580E+00 | loss scale: 1.0 | grad norm: 0.241 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:14:19.968852 | finish at 2025-09-10 12:51:09 + [2025-09-10 00:36:55] iteration 4782/ 11920 | consumed samples: 4896768 | elapsed time per iteration (ms): 5629.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.902786E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:09:46.076286 | finish at 2025-09-10 11:46:41 + [2025-09-10 00:37:00] iteration 4783/ 11920 | consumed samples: 4897792 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.911518E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:08:26.098687 | finish at 2025-09-10 11:45:26 + [2025-09-10 00:37:06] iteration 4784/ 11920 | consumed samples: 4898816 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.921793E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:08:23.786659 | finish at 2025-09-10 11:45:30 + [2025-09-10 00:37:12] iteration 4785/ 11920 | consumed samples: 4899840 | elapsed time per iteration (ms): 5627.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.905180E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:09:14.250846 | finish at 2025-09-10 11:46:26 + [2025-09-10 00:37:18] iteration 4786/ 11920 | consumed samples: 4900864 | elapsed time per iteration (ms): 6026.0 | throughput per GPU (TFLOP/s/GPU): 74.9 | MFU 7.58% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.890413E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:56:29.239237 | finish at 2025-09-10 12:33:47 + [2025-09-10 00:37:23] iteration 4787/ 11920 | consumed samples: 4901888 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.918444E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:08:52.186008 | finish at 2025-09-10 11:46:15 + [2025-09-10 00:37:29] iteration 4788/ 11920 | consumed samples: 4902912 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.894000E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:08:01.949710 | finish at 2025-09-10 11:45:31 + [2025-09-10 00:37:34] iteration 4789/ 11920 | consumed samples: 4903936 | elapsed time per iteration (ms): 5619.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.897765E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:07:49.661656 | finish at 2025-09-10 11:45:24 + [2025-09-10 00:37:40] iteration 4790/ 11920 | consumed samples: 4904960 | elapsed time per iteration (ms): 5974.9 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.906349E+00 | loss scale: 1.0 | grad norm: 0.242 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:50:01.298840 | finish at 2025-09-10 12:27:42 + [2025-09-10 00:37:46] iteration 4791/ 11920 | consumed samples: 4905984 | elapsed time per iteration (ms): 5964.1 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.920940E+00 | loss scale: 1.0 | grad norm: 0.262 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:48:38.350221 | finish at 2025-09-10 12:26:25 + [2025-09-10 00:37:52] iteration 4792/ 11920 | consumed samples: 4907008 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.897227E+00 | loss scale: 1.0 | grad norm: 0.265 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:07:39.071978 | finish at 2025-09-10 11:45:31 + [2025-09-10 00:37:58] iteration 4793/ 11920 | consumed samples: 4908032 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.895206E+00 | loss scale: 1.0 | grad norm: 0.258 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:07:42.709310 | finish at 2025-09-10 11:45:40 + [2025-09-10 00:38:03] iteration 4794/ 11920 | consumed samples: 4909056 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.911974E+00 | loss scale: 1.0 | grad norm: 0.250 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:07:30.628564 | finish at 2025-09-10 11:45:34 + [2025-09-10 00:38:09] iteration 4795/ 11920 | consumed samples: 4910080 | elapsed time per iteration (ms): 5857.0 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.909585E+00 | loss scale: 1.0 | grad norm: 0.245 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:35:30.961025 | finish at 2025-09-10 12:13:40 + [2025-09-10 00:38:15] iteration 4796/ 11920 | consumed samples: 4911104 | elapsed time per iteration (ms): 5885.4 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.909716E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:38:47.617181 | finish at 2025-09-10 12:17:03 + [2025-09-10 00:38:22] iteration 4797/ 11920 | consumed samples: 4912128 | elapsed time per iteration (ms): 6539.5 | throughput per GPU (TFLOP/s/GPU): 69.0 | MFU 6.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.895406E+00 | loss scale: 1.0 | grad norm: 0.245 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:56:20.877167 | finish at 2025-09-10 13:34:42 + [2025-09-10 00:38:27] iteration 4798/ 11920 | consumed samples: 4913152 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.903033E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:07:09.099744 | finish at 2025-09-10 11:45:36 + [2025-09-10 00:38:33] iteration 4799/ 11920 | consumed samples: 4914176 | elapsed time per iteration (ms): 5618.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.907269E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:06:51.878336 | finish at 2025-09-10 11:45:25 + [2025-09-10 00:38:38] iteration 4800/ 11920 | consumed samples: 4915200 | elapsed time per iteration (ms): 5632.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.899479E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:08:21.119728 | finish at 2025-09-10 11:47:00 + [2025-09-10 00:38:44] iteration 4801/ 11920 | consumed samples: 4916224 | elapsed time per iteration (ms): 5891.4 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.905746E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:39:00.614222 | finish at 2025-09-10 12:17:45 + [2025-09-10 00:38:50] iteration 4802/ 11920 | consumed samples: 4917248 | elapsed time per iteration (ms): 5844.9 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.911278E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:33:24.336850 | finish at 2025-09-10 12:12:14 + [2025-09-10 00:38:56] iteration 4803/ 11920 | consumed samples: 4918272 | elapsed time per iteration (ms): 5632.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.916560E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:08:06.569898 | finish at 2025-09-10 11:47:02 + [2025-09-10 00:39:01] iteration 4804/ 11920 | consumed samples: 4919296 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930458E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:06:52.044096 | finish at 2025-09-10 11:45:53 + [2025-09-10 00:39:07] iteration 4805/ 11920 | consumed samples: 4920320 | elapsed time per iteration (ms): 5994.8 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.905999E+00 | loss scale: 1.0 | grad norm: 0.142 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:50:52.683733 | finish at 2025-09-10 12:30:00 + [2025-09-10 00:39:13] iteration 4806/ 11920 | consumed samples: 4921344 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.893250E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:06:33.215133 | finish at 2025-09-10 11:45:46 + [2025-09-10 00:39:19] iteration 4807/ 11920 | consumed samples: 4922368 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.902768E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:06:50.458805 | finish at 2025-09-10 11:46:09 + [2025-09-10 00:39:25] iteration 4808/ 11920 | consumed samples: 4923392 | elapsed time per iteration (ms): 5866.4 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.892210E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:35:22.175951 | finish at 2025-09-10 12:14:47 + [2025-09-10 00:39:30] iteration 4809/ 11920 | consumed samples: 4924416 | elapsed time per iteration (ms): 5856.1 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.907982E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:34:02.819237 | finish at 2025-09-10 12:13:33 + [2025-09-10 00:39:36] iteration 4810/ 11920 | consumed samples: 4925440 | elapsed time per iteration (ms): 5973.1 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.894518E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:47:48.706720 | finish at 2025-09-10 12:27:25 + [2025-09-10 00:39:42] iteration 4811/ 11920 | consumed samples: 4926464 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.894322E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:06:25.984319 | finish at 2025-09-10 11:46:08 + [2025-09-10 00:39:48] iteration 4812/ 11920 | consumed samples: 4927488 | elapsed time per iteration (ms): 5896.4 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.901428E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:38:31.446637 | finish at 2025-09-10 12:18:19 + [2025-09-10 00:39:54] iteration 4813/ 11920 | consumed samples: 4928512 | elapsed time per iteration (ms): 6151.7 | throughput per GPU (TFLOP/s/GPU): 73.4 | MFU 7.42% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.893195E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:08:39.879569 | finish at 2025-09-10 12:48:34 + [2025-09-10 00:40:00] iteration 4814/ 11920 | consumed samples: 4929536 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.893937E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:06:07.160195 | finish at 2025-09-10 11:46:07 + [2025-09-10 00:40:05] iteration 4815/ 11920 | consumed samples: 4930560 | elapsed time per iteration (ms): 5623.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.902932E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:05:51.349965 | finish at 2025-09-10 11:45:57 + [2025-09-10 00:40:11] iteration 4816/ 11920 | consumed samples: 4931584 | elapsed time per iteration (ms): 5628.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.911770E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:06:24.272781 | finish at 2025-09-10 11:46:35 + [2025-09-10 00:40:17] iteration 4817/ 11920 | consumed samples: 4932608 | elapsed time per iteration (ms): 5616.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.912364E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:04:51.666864 | finish at 2025-09-10 11:45:08 + [2025-09-10 00:40:22] iteration 4818/ 11920 | consumed samples: 4933632 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.903022E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:05:47.273487 | finish at 2025-09-10 11:46:09 + [2025-09-10 00:40:28] iteration 4819/ 11920 | consumed samples: 4934656 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.918955E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:05:41.349032 | finish at 2025-09-10 11:46:09 + [2025-09-10 00:40:33] iteration 4820/ 11920 | consumed samples: 4935680 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.905855E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:05:33.743739 | finish at 2025-09-10 11:46:07 + [2025-09-10 00:40:39] iteration 4821/ 11920 | consumed samples: 4936704 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.902965E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:05:17.166884 | finish at 2025-09-10 11:45:56 + [2025-09-10 00:40:45] iteration 4822/ 11920 | consumed samples: 4937728 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.897070E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:05:20.631580 | finish at 2025-09-10 11:46:05 + [2025-09-10 00:40:50] iteration 4823/ 11920 | consumed samples: 4938752 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.911532E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:04:44.750015 | finish at 2025-09-10 11:45:35 + [2025-09-10 00:40:56] iteration 4824/ 11920 | consumed samples: 4939776 | elapsed time per iteration (ms): 5626.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.900640E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:05:25.179670 | finish at 2025-09-10 11:46:21 + [2025-09-10 00:41:02] iteration 4825/ 11920 | consumed samples: 4940800 | elapsed time per iteration (ms): 5841.5 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.905557E+00 | loss scale: 1.0 | grad norm: 0.238 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:30:45.312560 | finish at 2025-09-10 12:11:47 + [2025-09-10 00:41:07] iteration 4826/ 11920 | consumed samples: 4941824 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.892854E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:04:30.432266 | finish at 2025-09-10 11:45:38 + [2025-09-10 00:41:13] iteration 4827/ 11920 | consumed samples: 4942848 | elapsed time per iteration (ms): 5619.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.901289E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:04:16.288803 | finish at 2025-09-10 11:45:29 + [2025-09-10 00:41:19] iteration 4828/ 11920 | consumed samples: 4943872 | elapsed time per iteration (ms): 5618.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.904008E+00 | loss scale: 1.0 | grad norm: 0.256 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:04:06.638680 | finish at 2025-09-10 11:45:25 + [2025-09-10 00:41:24] iteration 4829/ 11920 | consumed samples: 4944896 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.906286E+00 | loss scale: 1.0 | grad norm: 0.263 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:04:23.251881 | finish at 2025-09-10 11:45:47 + [2025-09-10 00:41:30] iteration 4830/ 11920 | consumed samples: 4945920 | elapsed time per iteration (ms): 6238.0 | throughput per GPU (TFLOP/s/GPU): 72.4 | MFU 7.32% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.909686E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:17:07.339957 | finish at 2025-09-10 12:58:38 + [2025-09-10 00:41:36] iteration 4831/ 11920 | consumed samples: 4946944 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.911930E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:04:00.311020 | finish at 2025-09-10 11:45:36 + [2025-09-10 00:41:42] iteration 4832/ 11920 | consumed samples: 4947968 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.911510E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:04:26.162212 | finish at 2025-09-10 11:46:08 + [2025-09-10 00:41:47] iteration 4833/ 11920 | consumed samples: 4948992 | elapsed time per iteration (ms): 5626.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.904521E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:04:35.743116 | finish at 2025-09-10 11:46:23 + [2025-09-10 00:41:53] iteration 4834/ 11920 | consumed samples: 4950016 | elapsed time per iteration (ms): 5628.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.910996E+00 | loss scale: 1.0 | grad norm: 0.230 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:04:40.624792 | finish at 2025-09-10 11:46:34 + [2025-09-10 00:41:59] iteration 4835/ 11920 | consumed samples: 4951040 | elapsed time per iteration (ms): 5625.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.900413E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:04:19.658809 | finish at 2025-09-10 11:46:18 + [2025-09-10 00:42:04] iteration 4836/ 11920 | consumed samples: 4952064 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.903985E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:04:16.733529 | finish at 2025-09-10 11:46:21 + [2025-09-10 00:42:10] iteration 4837/ 11920 | consumed samples: 4953088 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.893445E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:03:40.489065 | finish at 2025-09-10 11:45:50 + [2025-09-10 00:42:15] iteration 4838/ 11920 | consumed samples: 4954112 | elapsed time per iteration (ms): 5618.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.891981E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:03:12.359641 | finish at 2025-09-10 11:45:28 + [2025-09-10 00:42:21] iteration 4839/ 11920 | consumed samples: 4955136 | elapsed time per iteration (ms): 5626.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.889225E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:04:02.900207 | finish at 2025-09-10 11:46:24 + [2025-09-10 00:42:27] iteration 4840/ 11920 | consumed samples: 4956160 | elapsed time per iteration (ms): 5629.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.899531E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:04:14.119749 | finish at 2025-09-10 11:46:41 + [2025-09-10 00:42:32] iteration 4841/ 11920 | consumed samples: 4957184 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.910942E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:03:36.971622 | finish at 2025-09-10 11:46:09 + [2025-09-10 00:42:38] iteration 4842/ 11920 | consumed samples: 4958208 | elapsed time per iteration (ms): 5628.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.895432E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:03:57.856319 | finish at 2025-09-10 11:46:36 + [2025-09-10 00:42:44] iteration 4843/ 11920 | consumed samples: 4959232 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.904816E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:03:36.164929 | finish at 2025-09-10 11:46:20 + [2025-09-10 00:42:49] iteration 4844/ 11920 | consumed samples: 4960256 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.887018E+00 | loss scale: 1.0 | grad norm: 0.128 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:03:21.347745 | finish at 2025-09-10 11:46:11 + [2025-09-10 00:42:55] iteration 4845/ 11920 | consumed samples: 4961280 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.897416E+00 | loss scale: 1.0 | grad norm: 0.134 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:03:08.204789 | finish at 2025-09-10 11:46:03 + [2025-09-10 00:43:01] iteration 4846/ 11920 | consumed samples: 4962304 | elapsed time per iteration (ms): 5894.4 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.901096E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:34:57.047164 | finish at 2025-09-10 12:17:58 + [2025-09-10 00:43:06] iteration 4847/ 11920 | consumed samples: 4963328 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.878511E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:02:42.918504 | finish at 2025-09-10 11:45:49 + [2025-09-10 00:43:12] iteration 4848/ 11920 | consumed samples: 4964352 | elapsed time per iteration (ms): 5614.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.900105E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:01:48.521324 | finish at 2025-09-10 11:45:00 + [2025-09-10 00:43:18] iteration 4849/ 11920 | consumed samples: 4965376 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.890938E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:02:17.653646 | finish at 2025-09-10 11:45:35 + [2025-09-10 00:43:23] iteration 4850/ 11920 | consumed samples: 4966400 | elapsed time per iteration (ms): 5618.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.892549E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:02:04.527776 | finish at 2025-09-10 11:45:28 + [2025-09-10 00:43:29] iteration 4851/ 11920 | consumed samples: 4967424 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.890278E+00 | loss scale: 1.0 | grad norm: 0.269 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:02:22.271782 | finish at 2025-09-10 11:45:51 + [2025-09-10 00:43:34] iteration 4852/ 11920 | consumed samples: 4968448 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.905226E+00 | loss scale: 1.0 | grad norm: 0.299 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:02:25.235533 | finish at 2025-09-10 11:46:00 + [2025-09-10 00:43:40] iteration 4853/ 11920 | consumed samples: 4969472 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.905947E+00 | loss scale: 1.0 | grad norm: 0.313 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:02:35.841264 | finish at 2025-09-10 11:46:16 + [2025-09-10 00:43:46] iteration 4854/ 11920 | consumed samples: 4970496 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.922548E+00 | loss scale: 1.0 | grad norm: 0.276 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:02:29.560368 | finish at 2025-09-10 11:46:15 + [2025-09-10 00:43:51] iteration 4855/ 11920 | consumed samples: 4971520 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.900065E+00 | loss scale: 1.0 | grad norm: 0.279 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:02:04.254051 | finish at 2025-09-10 11:45:56 + [2025-09-10 00:43:57] iteration 4856/ 11920 | consumed samples: 4972544 | elapsed time per iteration (ms): 5968.5 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.904291E+00 | loss scale: 1.0 | grad norm: 0.246 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:42:41.619705 | finish at 2025-09-10 12:26:39 + [2025-09-10 00:44:03] iteration 4857/ 11920 | consumed samples: 4973568 | elapsed time per iteration (ms): 5964.7 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.906903E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:42:08.492434 | finish at 2025-09-10 12:26:12 + [2025-09-10 00:44:09] iteration 4858/ 11920 | consumed samples: 4974592 | elapsed time per iteration (ms): 5627.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.896272E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:02:24.488281 | finish at 2025-09-10 11:46:33 + [2025-09-10 00:44:14] iteration 4859/ 11920 | consumed samples: 4975616 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.917911E+00 | loss scale: 1.0 | grad norm: 0.240 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:02:05.963253 | finish at 2025-09-10 11:46:20 + [2025-09-10 00:44:20] iteration 4860/ 11920 | consumed samples: 4976640 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.902379E+00 | loss scale: 1.0 | grad norm: 0.720 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:01:57.638917 | finish at 2025-09-10 11:46:18 + [2025-09-10 00:44:26] iteration 4861/ 11920 | consumed samples: 4977664 | elapsed time per iteration (ms): 5660.7 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.915829E+00 | loss scale: 1.0 | grad norm: 0.395 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:05:59.221850 | finish at 2025-09-10 11:50:25 + [2025-09-10 00:44:31] iteration 4862/ 11920 | consumed samples: 4978688 | elapsed time per iteration (ms): 5716.8 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.890004E+00 | loss scale: 1.0 | grad norm: 43.811 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:12:29.354275 | finish at 2025-09-10 11:57:01 + [2025-09-10 00:44:37] iteration 4863/ 11920 | consumed samples: 4979712 | elapsed time per iteration (ms): 5717.1 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 7.284352E+00 | loss scale: 1.0 | grad norm: 33.738 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:12:25.678346 | finish at 2025-09-10 11:57:03 + [2025-09-10 00:44:43] iteration 4864/ 11920 | consumed samples: 4980736 | elapsed time per iteration (ms): 5888.3 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 6.833797E+00 | loss scale: 1.0 | grad norm: 14.738 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:32:28.182529 | finish at 2025-09-10 12:17:11 + [2025-09-10 00:44:49] iteration 4865/ 11920 | consumed samples: 4981760 | elapsed time per iteration (ms): 6049.7 | throughput per GPU (TFLOP/s/GPU): 74.6 | MFU 7.55% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 9.020638E+00 | loss scale: 1.0 | grad norm: 98.742 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:51:20.884278 | finish at 2025-09-10 12:36:10 + [2025-09-10 00:44:55] iteration 4866/ 11920 | consumed samples: 4982784 | elapsed time per iteration (ms): 5744.0 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 1.059323E+01 | loss scale: 1.0 | grad norm: 101.845 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:15:17.861207 | finish at 2025-09-10 12:00:13 + [2025-09-10 00:45:01] iteration 4867/ 11920 | consumed samples: 4983808 | elapsed time per iteration (ms): 5913.8 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 1.409564E+01 | loss scale: 1.0 | grad norm: 20.842 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:35:09.858207 | finish at 2025-09-10 12:20:11 + [2025-09-10 00:45:06] iteration 4868/ 11920 | consumed samples: 4984832 | elapsed time per iteration (ms): 5687.9 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 1.053059E+01 | loss scale: 1.0 | grad norm: 10.159 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:08:31.321786 | finish at 2025-09-10 11:53:38 + [2025-09-10 00:45:12] iteration 4869/ 11920 | consumed samples: 4985856 | elapsed time per iteration (ms): 5882.4 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 1.022915E+01 | loss scale: 1.0 | grad norm: 13.860 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:31:16.526954 | finish at 2025-09-10 12:16:29 + [2025-09-10 00:45:18] iteration 4870/ 11920 | consumed samples: 4986880 | elapsed time per iteration (ms): 5506.0 | throughput per GPU (TFLOP/s/GPU): 82.0 | MFU 8.29% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 1.003836E+01 | loss scale: 1.0 | grad norm: 6.577 | num zeros: 16523811.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:46:57.238104 | finish at 2025-09-10 11:32:15 + [2025-09-10 00:45:24] iteration 4871/ 11920 | consumed samples: 4987904 | elapsed time per iteration (ms): 5767.3 | throughput per GPU (TFLOP/s/GPU): 78.3 | MFU 7.92% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 8.928322E+00 | loss scale: 1.0 | grad norm: 4.790 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:17:34.019606 | finish at 2025-09-10 12:02:58 + [2025-09-10 00:45:29] iteration 4872/ 11920 | consumed samples: 4988928 | elapsed time per iteration (ms): 5600.9 | throughput per GPU (TFLOP/s/GPU): 80.6 | MFU 8.15% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 8.855474E+00 | loss scale: 1.0 | grad norm: 6.357 | num zeros: 7080963.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:57:55.149462 | finish at 2025-09-10 11:43:24 + [2025-09-10 00:45:35] iteration 4873/ 11920 | consumed samples: 4989952 | elapsed time per iteration (ms): 5602.3 | throughput per GPU (TFLOP/s/GPU): 80.6 | MFU 8.15% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 8.788349E+00 | loss scale: 1.0 | grad norm: 3.715 | num zeros: 2360325.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:57:59.536968 | finish at 2025-09-10 11:43:34 + [2025-09-10 00:45:40] iteration 4874/ 11920 | consumed samples: 4990976 | elapsed time per iteration (ms): 5589.8 | throughput per GPU (TFLOP/s/GPU): 80.8 | MFU 8.17% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 8.197751E+00 | loss scale: 1.0 | grad norm: 2.690 | num zeros: 21244418.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:56:25.753480 | finish at 2025-09-10 11:42:06 + [2025-09-10 00:45:46] iteration 4875/ 11920 | consumed samples: 4992000 | elapsed time per iteration (ms): 5605.6 | throughput per GPU (TFLOP/s/GPU): 80.5 | MFU 8.14% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 8.030297E+00 | loss scale: 1.0 | grad norm: 3.292 | num zeros: 21244428.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:58:11.652715 | finish at 2025-09-10 11:43:58 + [2025-09-10 00:45:52] iteration 4876/ 11920 | consumed samples: 4993024 | elapsed time per iteration (ms): 5578.0 | throughput per GPU (TFLOP/s/GPU): 80.9 | MFU 8.18% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 7.877439E+00 | loss scale: 1.0 | grad norm: 2.056 | num zeros: 16524608.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:54:51.726382 | finish at 2025-09-10 11:40:43 + [2025-09-10 00:45:57] iteration 4877/ 11920 | consumed samples: 4994048 | elapsed time per iteration (ms): 5494.6 | throughput per GPU (TFLOP/s/GPU): 82.2 | MFU 8.31% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 8.189173E+00 | loss scale: 1.0 | grad norm: 5.702 | num zeros: 25964344.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:44:58.478281 | finish at 2025-09-10 11:30:56 + [2025-09-10 00:46:03] iteration 4878/ 11920 | consumed samples: 4995072 | elapsed time per iteration (ms): 5806.4 | throughput per GPU (TFLOP/s/GPU): 77.8 | MFU 7.86% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 7.756600E+00 | loss scale: 1.0 | grad norm: 2.410 | num zeros: 51930152.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:21:28.363659 | finish at 2025-09-10 12:07:31 + [2025-09-10 00:46:08] iteration 4879/ 11920 | consumed samples: 4996096 | elapsed time per iteration (ms): 5454.8 | throughput per GPU (TFLOP/s/GPU): 82.8 | MFU 8.37% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 7.460886E+00 | loss scale: 1.0 | grad norm: 1.508 | num zeros: 59010368.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:40:06.995902 | finish at 2025-09-10 11:26:15 + [2025-09-10 00:46:14] iteration 4880/ 11920 | consumed samples: 4997120 | elapsed time per iteration (ms): 5494.2 | throughput per GPU (TFLOP/s/GPU): 82.2 | MFU 8.31% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 7.106377E+00 | loss scale: 1.0 | grad norm: 1.432 | num zeros: 56647724.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:44:39.053802 | finish at 2025-09-10 11:30:53 + [2025-09-10 00:46:19] iteration 4881/ 11920 | consumed samples: 4998144 | elapsed time per iteration (ms): 5508.9 | throughput per GPU (TFLOP/s/GPU): 82.0 | MFU 8.29% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 7.088967E+00 | loss scale: 1.0 | grad norm: 2.385 | num zeros: 49569072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:46:17.020720 | finish at 2025-09-10 11:32:36 + [2025-09-10 00:46:25] iteration 4882/ 11920 | consumed samples: 4999168 | elapsed time per iteration (ms): 5593.4 | throughput per GPU (TFLOP/s/GPU): 80.7 | MFU 8.16% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 7.024648E+00 | loss scale: 1.0 | grad norm: 1.454 | num zeros: 49568296.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:56:06.379415 | finish at 2025-09-10 11:42:31 + [2025-09-10 00:46:31] iteration 4883/ 11920 | consumed samples: 5000192 | elapsed time per iteration (ms): 5544.1 | throughput per GPU (TFLOP/s/GPU): 81.4 | MFU 8.23% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 6.860208E+00 | loss scale: 1.0 | grad norm: 1.111 | num zeros: 63730992.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:50:14.122276 | finish at 2025-09-10 11:36:45 + [2025-09-10 00:46:36] iteration 4884/ 11920 | consumed samples: 5001216 | elapsed time per iteration (ms): 5531.8 | throughput per GPU (TFLOP/s/GPU): 81.6 | MFU 8.25% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 6.732201E+00 | loss scale: 1.0 | grad norm: 0.978 | num zeros: 70811264.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:48:42.008392 | finish at 2025-09-10 11:35:18 + [2025-09-10 00:46:42] iteration 4885/ 11920 | consumed samples: 5002240 | elapsed time per iteration (ms): 5546.2 | throughput per GPU (TFLOP/s/GPU): 81.4 | MFU 8.23% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 6.610904E+00 | loss scale: 1.0 | grad norm: 0.885 | num zeros: 66091320.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:50:17.752079 | finish at 2025-09-10 11:36:59 + [2025-09-10 00:46:47] iteration 4886/ 11920 | consumed samples: 5003264 | elapsed time per iteration (ms): 5549.5 | throughput per GPU (TFLOP/s/GPU): 81.4 | MFU 8.23% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 6.571599E+00 | loss scale: 1.0 | grad norm: 1.296 | num zeros: 63730324.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:50:35.226522 | finish at 2025-09-10 11:37:22 + [2025-09-10 00:46:53] iteration 4887/ 11920 | consumed samples: 5004288 | elapsed time per iteration (ms): 5556.1 | throughput per GPU (TFLOP/s/GPU): 81.3 | MFU 8.22% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 6.496019E+00 | loss scale: 1.0 | grad norm: 0.768 | num zeros: 63730312.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:51:15.767159 | finish at 2025-09-10 11:38:08 + [2025-09-10 00:46:58] iteration 4888/ 11920 | consumed samples: 5005312 | elapsed time per iteration (ms): 5771.5 | throughput per GPU (TFLOP/s/GPU): 78.2 | MFU 7.91% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 6.416054E+00 | loss scale: 1.0 | grad norm: 0.722 | num zeros: 54290472.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:16:25.036211 | finish at 2025-09-10 12:03:24 + [2025-09-10 00:47:04] iteration 4889/ 11920 | consumed samples: 5006336 | elapsed time per iteration (ms): 5530.2 | throughput per GPU (TFLOP/s/GPU): 81.6 | MFU 8.25% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 6.393728E+00 | loss scale: 1.0 | grad norm: 0.598 | num zeros: 54291300.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:48:03.049125 | finish at 2025-09-10 11:35:07 + [2025-09-10 00:47:10] iteration 4890/ 11920 | consumed samples: 5007360 | elapsed time per iteration (ms): 5529.5 | throughput per GPU (TFLOP/s/GPU): 81.7 | MFU 8.26% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 6.337256E+00 | loss scale: 1.0 | grad norm: 0.645 | num zeros: 40135224.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:47:52.628086 | finish at 2025-09-10 11:35:02 + [2025-09-10 00:47:15] iteration 4891/ 11920 | consumed samples: 5008384 | elapsed time per iteration (ms): 5529.0 | throughput per GPU (TFLOP/s/GPU): 81.7 | MFU 8.26% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 6.310758E+00 | loss scale: 1.0 | grad norm: 0.537 | num zeros: 54291244.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:47:43.445211 | finish at 2025-09-10 11:34:59 + [2025-09-10 00:47:21] iteration 4892/ 11920 | consumed samples: 5009408 | elapsed time per iteration (ms): 5527.4 | throughput per GPU (TFLOP/s/GPU): 81.7 | MFU 8.26% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 6.278259E+00 | loss scale: 1.0 | grad norm: 0.786 | num zeros: 51933252.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:47:26.636018 | finish at 2025-09-10 11:34:47 + [2025-09-10 00:47:26] iteration 4893/ 11920 | consumed samples: 5010432 | elapsed time per iteration (ms): 5554.6 | throughput per GPU (TFLOP/s/GPU): 81.3 | MFU 8.22% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 6.236415E+00 | loss scale: 1.0 | grad norm: 0.404 | num zeros: 56649288.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:50:31.842480 | finish at 2025-09-10 11:37:58 + [2025-09-10 00:47:32] iteration 4894/ 11920 | consumed samples: 5011456 | elapsed time per iteration (ms): 5555.3 | throughput per GPU (TFLOP/s/GPU): 81.3 | MFU 8.22% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 6.183503E+00 | loss scale: 1.0 | grad norm: 0.536 | num zeros: 51927832.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:50:31.400421 | finish at 2025-09-10 11:38:03 + [2025-09-10 00:47:37] iteration 4895/ 11920 | consumed samples: 5012480 | elapsed time per iteration (ms): 5541.3 | throughput per GPU (TFLOP/s/GPU): 81.5 | MFU 8.24% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 6.187113E+00 | loss scale: 1.0 | grad norm: 0.646 | num zeros: 42487416.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:48:47.505618 | finish at 2025-09-10 11:36:25 + [2025-09-10 00:47:43] iteration 4896/ 11920 | consumed samples: 5013504 | elapsed time per iteration (ms): 5550.8 | throughput per GPU (TFLOP/s/GPU): 81.3 | MFU 8.22% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 6.149833E+00 | loss scale: 1.0 | grad norm: 0.378 | num zeros: 42488072.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:49:48.526733 | finish at 2025-09-10 11:37:31 + [2025-09-10 00:47:49] iteration 4897/ 11920 | consumed samples: 5014528 | elapsed time per iteration (ms): 5921.6 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 6.125192E+00 | loss scale: 1.0 | grad norm: 0.491 | num zeros: 37767472.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:33:07.362363 | finish at 2025-09-10 12:20:56 + [2025-09-10 00:47:54] iteration 4898/ 11920 | consumed samples: 5015552 | elapsed time per iteration (ms): 5563.4 | throughput per GPU (TFLOP/s/GPU): 81.2 | MFU 8.21% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 6.112451E+00 | loss scale: 1.0 | grad norm: 0.438 | num zeros: 30684174.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:51:06.302159 | finish at 2025-09-10 11:39:01 + [2025-09-10 00:48:00] iteration 4899/ 11920 | consumed samples: 5016576 | elapsed time per iteration (ms): 5585.2 | throughput per GPU (TFLOP/s/GPU): 80.8 | MFU 8.17% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 6.062888E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 14161945.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:53:33.544068 | finish at 2025-09-10 11:41:33 + [2025-09-10 00:48:05] iteration 4900/ 11920 | consumed samples: 5017600 | elapsed time per iteration (ms): 5565.2 | throughput per GPU (TFLOP/s/GPU): 81.1 | MFU 8.20% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 6.050630E+00 | loss scale: 1.0 | grad norm: 0.278 | num zeros: 14163563.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:51:07.739782 | finish at 2025-09-10 11:39:13 + [2025-09-10 00:48:11] iteration 4901/ 11920 | consumed samples: 5018624 | elapsed time per iteration (ms): 5559.9 | throughput per GPU (TFLOP/s/GPU): 81.2 | MFU 8.21% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 6.023784E+00 | loss scale: 1.0 | grad norm: 0.291 | num zeros: 18882588.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:50:25.038826 | finish at 2025-09-10 11:38:36 + [2025-09-10 00:48:17] iteration 4902/ 11920 | consumed samples: 5019648 | elapsed time per iteration (ms): 5588.6 | throughput per GPU (TFLOP/s/GPU): 80.8 | MFU 8.17% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 6.028959E+00 | loss scale: 1.0 | grad norm: 0.272 | num zeros: 9441339.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:53:40.554970 | finish at 2025-09-10 11:41:57 + [2025-09-10 00:48:22] iteration 4903/ 11920 | consumed samples: 5020672 | elapsed time per iteration (ms): 5590.2 | throughput per GPU (TFLOP/s/GPU): 80.8 | MFU 8.17% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 6.005325E+00 | loss scale: 1.0 | grad norm: 0.247 | num zeros: 7085593.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:53:46.337671 | finish at 2025-09-10 11:42:09 + [2025-09-10 00:48:28] iteration 4904/ 11920 | consumed samples: 5021696 | elapsed time per iteration (ms): 5577.1 | throughput per GPU (TFLOP/s/GPU): 81.0 | MFU 8.19% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.969256E+00 | loss scale: 1.0 | grad norm: 0.391 | num zeros: 9441285.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:52:08.932198 | finish at 2025-09-10 11:40:37 + [2025-09-10 00:48:33] iteration 4905/ 11920 | consumed samples: 5022720 | elapsed time per iteration (ms): 5589.2 | throughput per GPU (TFLOP/s/GPU): 80.8 | MFU 8.17% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.968687E+00 | loss scale: 1.0 | grad norm: 0.272 | num zeros: 9442822.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:53:28.495705 | finish at 2025-09-10 11:42:02 + [2025-09-10 00:48:39] iteration 4906/ 11920 | consumed samples: 5023744 | elapsed time per iteration (ms): 5590.8 | throughput per GPU (TFLOP/s/GPU): 80.8 | MFU 8.17% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.938702E+00 | loss scale: 1.0 | grad norm: 0.259 | num zeros: 7082503.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:53:33.705975 | finish at 2025-09-10 11:42:13 + [2025-09-10 00:48:45] iteration 4907/ 11920 | consumed samples: 5024768 | elapsed time per iteration (ms): 5588.0 | throughput per GPU (TFLOP/s/GPU): 80.8 | MFU 8.17% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.937603E+00 | loss scale: 1.0 | grad norm: 0.305 | num zeros: 2361111.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:53:08.666151 | finish at 2025-09-10 11:41:53 + [2025-09-10 00:48:50] iteration 4908/ 11920 | consumed samples: 5025792 | elapsed time per iteration (ms): 5600.8 | throughput per GPU (TFLOP/s/GPU): 80.6 | MFU 8.15% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.934845E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 2360356.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:54:32.948621 | finish at 2025-09-10 11:43:23 + [2025-09-10 00:48:56] iteration 4909/ 11920 | consumed samples: 5026816 | elapsed time per iteration (ms): 5601.3 | throughput per GPU (TFLOP/s/GPU): 80.6 | MFU 8.15% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.903145E+00 | loss scale: 1.0 | grad norm: 0.335 | num zeros: 2360321.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:54:30.578912 | finish at 2025-09-10 11:43:26 + [2025-09-10 00:49:01] iteration 4910/ 11920 | consumed samples: 5027840 | elapsed time per iteration (ms): 5587.1 | throughput per GPU (TFLOP/s/GPU): 80.8 | MFU 8.17% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.897310E+00 | loss scale: 1.0 | grad norm: 0.307 | num zeros: 13.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:52:45.596273 | finish at 2025-09-10 11:41:47 + [2025-09-10 00:49:07] iteration 4911/ 11920 | consumed samples: 5028864 | elapsed time per iteration (ms): 5606.9 | throughput per GPU (TFLOP/s/GPU): 80.5 | MFU 8.14% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.906716E+00 | loss scale: 1.0 | grad norm: 0.437 | num zeros: 2360320.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:54:58.553052 | finish at 2025-09-10 11:44:05 + [2025-09-10 00:49:13] iteration 4912/ 11920 | consumed samples: 5029888 | elapsed time per iteration (ms): 5981.6 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.894096E+00 | loss scale: 1.0 | grad norm: 0.404 | num zeros: 2360321.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:38:39.360558 | finish at 2025-09-10 12:27:52 + [2025-09-10 00:49:19] iteration 4913/ 11920 | consumed samples: 5030912 | elapsed time per iteration (ms): 6113.5 | throughput per GPU (TFLOP/s/GPU): 73.9 | MFU 7.47% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.862368E+00 | loss scale: 1.0 | grad norm: 0.317 | num zeros: 2360323.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:53:57.457377 | finish at 2025-09-10 12:43:16 + [2025-09-10 00:49:25] iteration 4914/ 11920 | consumed samples: 5031936 | elapsed time per iteration (ms): 5851.0 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.835306E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 2360324.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:23:11.859281 | finish at 2025-09-10 12:12:37 + [2025-09-10 00:49:30] iteration 4915/ 11920 | consumed samples: 5032960 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.832539E+00 | loss scale: 1.0 | grad norm: 0.314 | num zeros: 2360323.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:56:11.628166 | finish at 2025-09-10 11:45:42 + [2025-09-10 00:49:36] iteration 4916/ 11920 | consumed samples: 5033984 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.807082E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 2360351.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:56:42.928792 | finish at 2025-09-10 11:46:19 + [2025-09-10 00:49:42] iteration 4917/ 11920 | consumed samples: 5035008 | elapsed time per iteration (ms): 5885.7 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.803364E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 2360322.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:26:57.865896 | finish at 2025-09-10 12:16:40 + [2025-09-10 00:49:48] iteration 4918/ 11920 | consumed samples: 5036032 | elapsed time per iteration (ms): 5644.5 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.769883E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 2361858.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:58:42.757401 | finish at 2025-09-10 11:48:30 + [2025-09-10 00:49:53] iteration 4919/ 11920 | consumed samples: 5037056 | elapsed time per iteration (ms): 5652.7 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.763734E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 4720642.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:59:34.343685 | finish at 2025-09-10 11:49:28 + [2025-09-10 00:49:59] iteration 4920/ 11920 | consumed samples: 5038080 | elapsed time per iteration (ms): 5662.4 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.750562E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 2360401.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:00:36.783361 | finish at 2025-09-10 11:50:36 + [2025-09-10 00:50:05] iteration 4921/ 11920 | consumed samples: 5039104 | elapsed time per iteration (ms): 6026.7 | throughput per GPU (TFLOP/s/GPU): 74.9 | MFU 7.57% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.732427E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 4720640.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:43:01.173803 | finish at 2025-09-10 12:33:06 + [2025-09-10 00:50:11] iteration 4922/ 11920 | consumed samples: 5040128 | elapsed time per iteration (ms): 5671.5 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.742570E+00 | loss scale: 1.0 | grad norm: 0.307 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:01:28.911510 | finish at 2025-09-10 11:51:40 + [2025-09-10 00:50:16] iteration 4923/ 11920 | consumed samples: 5041152 | elapsed time per iteration (ms): 5631.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.772856E+00 | loss scale: 1.0 | grad norm: 0.427 | num zeros: 2360321.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:56:46.461524 | finish at 2025-09-10 11:47:03 + [2025-09-10 00:50:22] iteration 4924/ 11920 | consumed samples: 5042176 | elapsed time per iteration (ms): 5832.8 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.722008E+00 | loss scale: 1.0 | grad norm: 0.299 | num zeros: 2360322.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:20:06.093043 | finish at 2025-09-10 12:10:28 + [2025-09-10 00:50:28] iteration 4925/ 11920 | consumed samples: 5043200 | elapsed time per iteration (ms): 5681.7 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.734401E+00 | loss scale: 1.0 | grad norm: 0.432 | num zeros: 22.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:02:23.578161 | finish at 2025-09-10 11:52:51 + [2025-09-10 00:50:33] iteration 4926/ 11920 | consumed samples: 5044224 | elapsed time per iteration (ms): 5660.5 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.699329E+00 | loss scale: 1.0 | grad norm: 0.262 | num zeros: 2360321.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:59:49.297227 | finish at 2025-09-10 11:50:23 + [2025-09-10 00:50:39] iteration 4927/ 11920 | consumed samples: 5045248 | elapsed time per iteration (ms): 5634.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.684956E+00 | loss scale: 1.0 | grad norm: 0.334 | num zeros: 2364934.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:56:44.044395 | finish at 2025-09-10 11:47:23 + [2025-09-10 00:50:45] iteration 4928/ 11920 | consumed samples: 5046272 | elapsed time per iteration (ms): 5663.5 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.666601E+00 | loss scale: 1.0 | grad norm: 0.259 | num zeros: 33.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:59:59.285847 | finish at 2025-09-10 11:50:44 + [2025-09-10 00:50:51] iteration 4929/ 11920 | consumed samples: 5047296 | elapsed time per iteration (ms): 5885.2 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.633783E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 37.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:25:43.265020 | finish at 2025-09-10 12:16:34 + [2025-09-10 00:50:57] iteration 4930/ 11920 | consumed samples: 5048320 | elapsed time per iteration (ms): 5972.2 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.645439E+00 | loss scale: 1.0 | grad norm: 0.272 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:35:45.994062 | finish at 2025-09-10 12:26:43 + [2025-09-10 00:51:02] iteration 4931/ 11920 | consumed samples: 5049344 | elapsed time per iteration (ms): 5670.0 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.630876E+00 | loss scale: 1.0 | grad norm: 0.270 | num zeros: 2360320.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:00:27.597207 | finish at 2025-09-10 11:51:30 + [2025-09-10 00:51:08] iteration 4932/ 11920 | consumed samples: 5050368 | elapsed time per iteration (ms): 5651.8 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.623322E+00 | loss scale: 1.0 | grad norm: 0.255 | num zeros: 2307.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:58:15.084378 | finish at 2025-09-10 11:49:23 + [2025-09-10 00:51:14] iteration 4933/ 11920 | consumed samples: 5051392 | elapsed time per iteration (ms): 5911.3 | throughput per GPU (TFLOP/s/GPU): 76.4 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.586366E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 2308.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:28:21.976180 | finish at 2025-09-10 12:19:36 + [2025-09-10 00:51:20] iteration 4934/ 11920 | consumed samples: 5052416 | elapsed time per iteration (ms): 5680.6 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.595592E+00 | loss scale: 1.0 | grad norm: 0.286 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:01:24.666100 | finish at 2025-09-10 11:52:44 + [2025-09-10 00:51:25] iteration 4935/ 11920 | consumed samples: 5053440 | elapsed time per iteration (ms): 5668.5 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.555960E+00 | loss scale: 1.0 | grad norm: 0.251 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:59:54.337233 | finish at 2025-09-10 11:51:20 + [2025-09-10 00:51:31] iteration 4936/ 11920 | consumed samples: 5054464 | elapsed time per iteration (ms): 5700.1 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.552398E+00 | loss scale: 1.0 | grad norm: 0.243 | num zeros: 2360321.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:03:29.761105 | finish at 2025-09-10 11:55:01 + [2025-09-10 00:51:37] iteration 4937/ 11920 | consumed samples: 5055488 | elapsed time per iteration (ms): 5693.1 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.519274E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:02:34.928784 | finish at 2025-09-10 11:54:12 + [2025-09-10 00:51:42] iteration 4938/ 11920 | consumed samples: 5056512 | elapsed time per iteration (ms): 5692.4 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.521693E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:02:24.015376 | finish at 2025-09-10 11:54:06 + [2025-09-10 00:51:48] iteration 4939/ 11920 | consumed samples: 5057536 | elapsed time per iteration (ms): 5693.3 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.501685E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 1537.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:02:24.651071 | finish at 2025-09-10 11:54:13 + [2025-09-10 00:51:54] iteration 4940/ 11920 | consumed samples: 5058560 | elapsed time per iteration (ms): 6322.6 | throughput per GPU (TFLOP/s/GPU): 71.4 | MFU 7.22% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.521255E+00 | loss scale: 1.0 | grad norm: 0.290 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:15:31.569152 | finish at 2025-09-10 13:07:26 + [2025-09-10 00:52:00] iteration 4941/ 11920 | consumed samples: 5059584 | elapsed time per iteration (ms): 5981.6 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.514574E+00 | loss scale: 1.0 | grad norm: 0.474 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:35:45.425322 | finish at 2025-09-10 12:27:46 + [2025-09-10 00:52:06] iteration 4942/ 11920 | consumed samples: 5060608 | elapsed time per iteration (ms): 6018.6 | throughput per GPU (TFLOP/s/GPU): 75.0 | MFU 7.58% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.527566E+00 | loss scale: 1.0 | grad norm: 0.591 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:39:57.602713 | finish at 2025-09-10 12:32:04 + [2025-09-10 00:52:12] iteration 4943/ 11920 | consumed samples: 5061632 | elapsed time per iteration (ms): 5681.3 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.509626E+00 | loss scale: 1.0 | grad norm: 0.413 | num zeros: 33.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:00:38.221645 | finish at 2025-09-10 11:52:50 + [2025-09-10 00:52:18] iteration 4944/ 11920 | consumed samples: 5062656 | elapsed time per iteration (ms): 5699.5 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.495604E+00 | loss scale: 1.0 | grad norm: 0.342 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:02:39.942108 | finish at 2025-09-10 11:54:58 + [2025-09-10 00:52:23] iteration 4945/ 11920 | consumed samples: 5063680 | elapsed time per iteration (ms): 5722.3 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.492810E+00 | loss scale: 1.0 | grad norm: 0.490 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:05:13.056171 | finish at 2025-09-10 11:57:36 + [2025-09-10 00:52:29] iteration 4946/ 11920 | consumed samples: 5064704 | elapsed time per iteration (ms): 5706.4 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 8.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.464215E+00 | loss scale: 1.0 | grad norm: 0.374 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:03:16.411410 | finish at 2025-09-10 11:55:46 + [2025-09-10 00:52:35] iteration 4947/ 11920 | consumed samples: 5065728 | elapsed time per iteration (ms): 5712.5 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.449410E+00 | loss scale: 1.0 | grad norm: 0.489 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:03:53.238228 | finish at 2025-09-10 11:56:28 + [2025-09-10 00:52:41] iteration 4948/ 11920 | consumed samples: 5066752 | elapsed time per iteration (ms): 5718.0 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.415856E+00 | loss scale: 1.0 | grad norm: 0.271 | num zeros: 21.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:04:26.003594 | finish at 2025-09-10 11:57:07 + [2025-09-10 00:52:46] iteration 4949/ 11920 | consumed samples: 5067776 | elapsed time per iteration (ms): 5728.5 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.424025E+00 | loss scale: 1.0 | grad norm: 0.409 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:05:33.076890 | finish at 2025-09-10 11:58:19 + [2025-09-10 00:52:52] iteration 4950/ 11920 | consumed samples: 5068800 | elapsed time per iteration (ms): 5714.1 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.411506E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:03:47.164853 | finish at 2025-09-10 11:56:39 + [2025-09-10 00:52:58] iteration 4951/ 11920 | consumed samples: 5069824 | elapsed time per iteration (ms): 6067.5 | throughput per GPU (TFLOP/s/GPU): 74.4 | MFU 7.52% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.384172E+00 | loss scale: 1.0 | grad norm: 0.284 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:44:44.242144 | finish at 2025-09-10 12:37:42 + [2025-09-10 00:53:04] iteration 4952/ 11920 | consumed samples: 5070848 | elapsed time per iteration (ms): 6075.7 | throughput per GPU (TFLOP/s/GPU): 74.3 | MFU 7.51% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.356054E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:45:35.246990 | finish at 2025-09-10 12:38:39 + [2025-09-10 00:53:10] iteration 4953/ 11920 | consumed samples: 5071872 | elapsed time per iteration (ms): 5718.9 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.347661E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:04:03.542837 | finish at 2025-09-10 11:57:13 + [2025-09-10 00:53:16] iteration 4954/ 11920 | consumed samples: 5072896 | elapsed time per iteration (ms): 5722.0 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.331192E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:04:19.785015 | finish at 2025-09-10 11:57:35 + [2025-09-10 00:53:21] iteration 4955/ 11920 | consumed samples: 5073920 | elapsed time per iteration (ms): 5739.4 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.326647E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:06:14.737709 | finish at 2025-09-10 11:59:36 + [2025-09-10 00:53:27] iteration 4956/ 11920 | consumed samples: 5074944 | elapsed time per iteration (ms): 6102.6 | throughput per GPU (TFLOP/s/GPU): 74.0 | MFU 7.48% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.314738E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:48:18.646549 | finish at 2025-09-10 12:41:46 + [2025-09-10 00:53:33] iteration 4957/ 11920 | consumed samples: 5075968 | elapsed time per iteration (ms): 5747.5 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.94% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.291748E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:06:59.568184 | finish at 2025-09-10 12:00:33 + [2025-09-10 00:53:39] iteration 4958/ 11920 | consumed samples: 5076992 | elapsed time per iteration (ms): 6227.4 | throughput per GPU (TFLOP/s/GPU): 72.5 | MFU 7.33% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.281983E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:02:35.013180 | finish at 2025-09-10 12:56:14 + [2025-09-10 00:53:45] iteration 4959/ 11920 | consumed samples: 5078016 | elapsed time per iteration (ms): 6023.1 | throughput per GPU (TFLOP/s/GPU): 75.0 | MFU 7.58% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.280992E+00 | loss scale: 1.0 | grad norm: 0.328 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:38:47.103771 | finish at 2025-09-10 12:32:33 + [2025-09-10 00:53:51] iteration 4960/ 11920 | consumed samples: 5079040 | elapsed time per iteration (ms): 5725.9 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.314358E+00 | loss scale: 1.0 | grad norm: 0.524 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:04:12.379704 | finish at 2025-09-10 11:58:04 + [2025-09-10 00:53:57] iteration 4961/ 11920 | consumed samples: 5080064 | elapsed time per iteration (ms): 6008.5 | throughput per GPU (TFLOP/s/GPU): 75.1 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.272666E+00 | loss scale: 1.0 | grad norm: 0.448 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:36:53.392768 | finish at 2025-09-10 12:30:51 + [2025-09-10 00:54:03] iteration 4962/ 11920 | consumed samples: 5081088 | elapsed time per iteration (ms): 5758.2 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.257274E+00 | loss scale: 1.0 | grad norm: 0.308 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:07:45.614835 | finish at 2025-09-10 12:01:49 + [2025-09-10 00:54:09] iteration 4963/ 11920 | consumed samples: 5082112 | elapsed time per iteration (ms): 6407.7 | throughput per GPU (TFLOP/s/GPU): 70.5 | MFU 7.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.242294E+00 | loss scale: 1.0 | grad norm: 0.271 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:22:58.583299 | finish at 2025-09-10 13:17:08 + [2025-09-10 00:54:15] iteration 4964/ 11920 | consumed samples: 5083136 | elapsed time per iteration (ms): 5747.4 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.94% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.218726E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:06:18.911399 | finish at 2025-09-10 12:00:34 + [2025-09-10 00:54:21] iteration 4965/ 11920 | consumed samples: 5084160 | elapsed time per iteration (ms): 5741.5 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.211857E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:05:32.448527 | finish at 2025-09-10 11:59:53 + [2025-09-10 00:54:27] iteration 4966/ 11920 | consumed samples: 5085184 | elapsed time per iteration (ms): 5751.4 | throughput per GPU (TFLOP/s/GPU): 78.5 | MFU 7.94% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.182868E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:06:35.565492 | finish at 2025-09-10 12:01:02 + [2025-09-10 00:54:33] iteration 4967/ 11920 | consumed samples: 5086208 | elapsed time per iteration (ms): 6078.9 | throughput per GPU (TFLOP/s/GPU): 74.3 | MFU 7.51% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.195389E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:44:26.610622 | finish at 2025-09-10 12:38:59 + [2025-09-10 00:54:38] iteration 4968/ 11920 | consumed samples: 5087232 | elapsed time per iteration (ms): 5745.8 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.169663E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:05:44.468573 | finish at 2025-09-10 12:00:23 + [2025-09-10 00:54:44] iteration 4969/ 11920 | consumed samples: 5088256 | elapsed time per iteration (ms): 5746.0 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.94% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.163079E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:05:40.754606 | finish at 2025-09-10 12:00:25 + [2025-09-10 00:54:50] iteration 4970/ 11920 | consumed samples: 5089280 | elapsed time per iteration (ms): 5749.3 | throughput per GPU (TFLOP/s/GPU): 78.5 | MFU 7.94% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.136586E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:05:57.553828 | finish at 2025-09-10 12:00:47 + [2025-09-10 00:54:56] iteration 4971/ 11920 | consumed samples: 5090304 | elapsed time per iteration (ms): 5757.9 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.142493E+00 | loss scale: 1.0 | grad norm: 0.280 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:06:51.577513 | finish at 2025-09-10 12:01:47 + [2025-09-10 00:55:01] iteration 4972/ 11920 | consumed samples: 5091328 | elapsed time per iteration (ms): 5759.1 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.155094E+00 | loss scale: 1.0 | grad norm: 0.541 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:06:54.569427 | finish at 2025-09-10 12:01:56 + [2025-09-10 00:55:07] iteration 4973/ 11920 | consumed samples: 5092352 | elapsed time per iteration (ms): 5743.8 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.177234E+00 | loss scale: 1.0 | grad norm: 0.635 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:05:02.438116 | finish at 2025-09-10 12:00:10 + [2025-09-10 00:55:13] iteration 4974/ 11920 | consumed samples: 5093376 | elapsed time per iteration (ms): 5759.7 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.134587E+00 | loss scale: 1.0 | grad norm: 0.284 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:06:46.878272 | finish at 2025-09-10 12:02:00 + [2025-09-10 00:55:19] iteration 4975/ 11920 | consumed samples: 5094400 | elapsed time per iteration (ms): 5971.6 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.114160E+00 | loss scale: 1.0 | grad norm: 0.378 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:31:12.755764 | finish at 2025-09-10 12:26:32 + [2025-09-10 00:55:25] iteration 4976/ 11920 | consumed samples: 5095424 | elapsed time per iteration (ms): 5753.5 | throughput per GPU (TFLOP/s/GPU): 78.5 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.100619E+00 | loss scale: 1.0 | grad norm: 0.347 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:05:52.176414 | finish at 2025-09-10 12:01:17 + [2025-09-10 00:55:30] iteration 4977/ 11920 | consumed samples: 5096448 | elapsed time per iteration (ms): 5743.4 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.072271E+00 | loss scale: 1.0 | grad norm: 0.330 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:04:36.385489 | finish at 2025-09-10 12:00:07 + [2025-09-10 00:55:36] iteration 4978/ 11920 | consumed samples: 5097472 | elapsed time per iteration (ms): 5777.3 | throughput per GPU (TFLOP/s/GPU): 78.1 | MFU 7.90% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.075355E+00 | loss scale: 1.0 | grad norm: 0.243 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:08:26.161423 | finish at 2025-09-10 12:04:02 + [2025-09-10 00:55:42] iteration 4979/ 11920 | consumed samples: 5098496 | elapsed time per iteration (ms): 5759.0 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.069330E+00 | loss scale: 1.0 | grad norm: 0.376 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:06:13.310454 | finish at 2025-09-10 12:01:55 + [2025-09-10 00:55:48] iteration 4980/ 11920 | consumed samples: 5099520 | elapsed time per iteration (ms): 5762.7 | throughput per GPU (TFLOP/s/GPU): 78.3 | MFU 7.92% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.067871E+00 | loss scale: 1.0 | grad norm: 0.404 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:06:33.254385 | finish at 2025-09-10 12:02:21 + [2025-09-10 00:55:53] iteration 4981/ 11920 | consumed samples: 5100544 | elapsed time per iteration (ms): 5744.5 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.073858E+00 | loss scale: 1.0 | grad norm: 0.375 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:04:20.931099 | finish at 2025-09-10 12:00:14 + [2025-09-10 00:55:59] iteration 4982/ 11920 | consumed samples: 5101568 | elapsed time per iteration (ms): 5765.5 | throughput per GPU (TFLOP/s/GPU): 78.3 | MFU 7.92% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.116696E+00 | loss scale: 1.0 | grad norm: 0.527 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:06:40.917069 | finish at 2025-09-10 12:02:40 + [2025-09-10 00:56:05] iteration 4983/ 11920 | consumed samples: 5102592 | elapsed time per iteration (ms): 5759.2 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.084278E+00 | loss scale: 1.0 | grad norm: 0.450 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:05:51.303133 | finish at 2025-09-10 12:01:56 + [2025-09-10 00:56:11] iteration 4984/ 11920 | consumed samples: 5103616 | elapsed time per iteration (ms): 5716.9 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.119886E+00 | loss scale: 1.0 | grad norm: 0.557 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:00:52.071894 | finish at 2025-09-10 11:57:03 + [2025-09-10 00:56:16] iteration 4985/ 11920 | consumed samples: 5104640 | elapsed time per iteration (ms): 5735.2 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.053617E+00 | loss scale: 1.0 | grad norm: 0.353 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:02:53.275856 | finish at 2025-09-10 11:59:10 + [2025-09-10 00:56:22] iteration 4986/ 11920 | consumed samples: 5105664 | elapsed time per iteration (ms): 5766.3 | throughput per GPU (TFLOP/s/GPU): 78.3 | MFU 7.92% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.034674E+00 | loss scale: 1.0 | grad norm: 0.394 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:06:23.723980 | finish at 2025-09-10 12:02:46 + [2025-09-10 00:56:28] iteration 4987/ 11920 | consumed samples: 5106688 | elapsed time per iteration (ms): 5983.3 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.035491E+00 | loss scale: 1.0 | grad norm: 0.284 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:31:22.362504 | finish at 2025-09-10 12:27:51 + [2025-09-10 00:56:34] iteration 4988/ 11920 | consumed samples: 5107712 | elapsed time per iteration (ms): 5770.5 | throughput per GPU (TFLOP/s/GPU): 78.2 | MFU 7.91% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.996250E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:06:41.247750 | finish at 2025-09-10 12:03:15 + [2025-09-10 00:56:40] iteration 4989/ 11920 | consumed samples: 5108736 | elapsed time per iteration (ms): 5784.2 | throughput per GPU (TFLOP/s/GPU): 78.1 | MFU 7.89% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.993707E+00 | loss scale: 1.0 | grad norm: 0.251 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:08:10.504697 | finish at 2025-09-10 12:04:50 + [2025-09-10 00:56:45] iteration 4990/ 11920 | consumed samples: 5109760 | elapsed time per iteration (ms): 5768.9 | throughput per GPU (TFLOP/s/GPU): 78.3 | MFU 7.91% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.967944E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:06:18.724265 | finish at 2025-09-10 12:03:04 + [2025-09-10 00:56:51] iteration 4991/ 11920 | consumed samples: 5110784 | elapsed time per iteration (ms): 5765.7 | throughput per GPU (TFLOP/s/GPU): 78.3 | MFU 7.92% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.961913E+00 | loss scale: 1.0 | grad norm: 0.275 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:05:50.227081 | finish at 2025-09-10 12:02:41 + [2025-09-10 00:56:57] iteration 4992/ 11920 | consumed samples: 5111808 | elapsed time per iteration (ms): 5767.8 | throughput per GPU (TFLOP/s/GPU): 78.3 | MFU 7.91% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.933597E+00 | loss scale: 1.0 | grad norm: 0.335 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:05:59.642788 | finish at 2025-09-10 12:02:57 + [2025-09-10 00:57:03] iteration 4993/ 11920 | consumed samples: 5112832 | elapsed time per iteration (ms): 5771.2 | throughput per GPU (TFLOP/s/GPU): 78.2 | MFU 7.91% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.947771E+00 | loss scale: 1.0 | grad norm: 0.354 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:06:17.288618 | finish at 2025-09-10 12:03:20 + [2025-09-10 00:57:09] iteration 4994/ 11920 | consumed samples: 5113856 | elapsed time per iteration (ms): 5770.6 | throughput per GPU (TFLOP/s/GPU): 78.2 | MFU 7.91% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.941673E+00 | loss scale: 1.0 | grad norm: 0.430 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:06:07.007726 | finish at 2025-09-10 12:03:16 + [2025-09-10 00:57:15] iteration 4995/ 11920 | consumed samples: 5114880 | elapsed time per iteration (ms): 5978.2 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.908467E+00 | loss scale: 1.0 | grad norm: 0.301 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:29:59.143684 | finish at 2025-09-10 12:27:14 + [2025-09-10 00:57:20] iteration 4996/ 11920 | consumed samples: 5115904 | elapsed time per iteration (ms): 5760.6 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.92% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.879803E+00 | loss scale: 1.0 | grad norm: 0.296 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:04:46.073115 | finish at 2025-09-10 12:02:06 + [2025-09-10 00:57:26] iteration 4997/ 11920 | consumed samples: 5116928 | elapsed time per iteration (ms): 5765.4 | throughput per GPU (TFLOP/s/GPU): 78.3 | MFU 7.92% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.929814E+00 | loss scale: 1.0 | grad norm: 0.502 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:05:13.597993 | finish at 2025-09-10 12:02:40 + [2025-09-10 00:57:32] iteration 4998/ 11920 | consumed samples: 5117952 | elapsed time per iteration (ms): 5742.9 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.912495E+00 | loss scale: 1.0 | grad norm: 0.476 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:02:32.117074 | finish at 2025-09-10 12:00:04 + [2025-09-10 00:57:38] iteration 4999/ 11920 | consumed samples: 5118976 | elapsed time per iteration (ms): 5776.3 | throughput per GPU (TFLOP/s/GPU): 78.2 | MFU 7.90% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.877689E+00 | loss scale: 1.0 | grad norm: 0.246 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:06:17.638320 | finish at 2025-09-10 12:03:55 + [2025-09-10 00:57:43] iteration 5000/ 11920 | consumed samples: 5120000 | elapsed time per iteration (ms): 5776.4 | throughput per GPU (TFLOP/s/GPU): 78.2 | MFU 7.90% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.870082E+00 | loss scale: 1.0 | grad norm: 0.350 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:06:12.789259 | finish at 2025-09-10 12:03:56 + [2025-09-10 00:57:49] iteration 5001/ 11920 | consumed samples: 5121024 | elapsed time per iteration (ms): 5783.8 | throughput per GPU (TFLOP/s/GPU): 78.1 | MFU 7.89% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.848892E+00 | loss scale: 1.0 | grad norm: 0.333 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:06:57.941505 | finish at 2025-09-10 12:04:47 + [2025-09-10 00:57:55] iteration 5002/ 11920 | consumed samples: 5122048 | elapsed time per iteration (ms): 5786.5 | throughput per GPU (TFLOP/s/GPU): 78.0 | MFU 7.89% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.886424E+00 | loss scale: 1.0 | grad norm: 0.446 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:07:11.086012 | finish at 2025-09-10 12:05:06 + [2025-09-10 00:58:01] iteration 5003/ 11920 | consumed samples: 5123072 | elapsed time per iteration (ms): 5803.2 | throughput per GPU (TFLOP/s/GPU): 77.8 | MFU 7.87% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.854560E+00 | loss scale: 1.0 | grad norm: 0.442 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:09:00.998307 | finish at 2025-09-10 12:07:02 + [2025-09-10 00:58:07] iteration 5004/ 11920 | consumed samples: 5124096 | elapsed time per iteration (ms): 5799.4 | throughput per GPU (TFLOP/s/GPU): 77.9 | MFU 7.87% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.863266E+00 | loss scale: 1.0 | grad norm: 0.486 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:08:28.937940 | finish at 2025-09-10 12:06:35 + [2025-09-10 00:58:12] iteration 5005/ 11920 | consumed samples: 5125120 | elapsed time per iteration (ms): 5765.5 | throughput per GPU (TFLOP/s/GPU): 78.3 | MFU 7.92% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.865686E+00 | loss scale: 1.0 | grad norm: 0.421 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:04:28.708302 | finish at 2025-09-10 12:02:41 + [2025-09-10 00:58:18] iteration 5006/ 11920 | consumed samples: 5126144 | elapsed time per iteration (ms): 5765.8 | throughput per GPU (TFLOP/s/GPU): 78.3 | MFU 7.92% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.831249E+00 | loss scale: 1.0 | grad norm: 0.304 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:04:24.670312 | finish at 2025-09-10 12:02:43 + [2025-09-10 00:58:24] iteration 5007/ 11920 | consumed samples: 5127168 | elapsed time per iteration (ms): 5742.0 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.792734E+00 | loss scale: 1.0 | grad norm: 0.271 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:01:34.148390 | finish at 2025-09-10 11:59:58 + [2025-09-10 00:58:30] iteration 5008/ 11920 | consumed samples: 5128192 | elapsed time per iteration (ms): 5748.4 | throughput per GPU (TFLOP/s/GPU): 78.5 | MFU 7.94% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.767648E+00 | loss scale: 1.0 | grad norm: 0.287 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:02:12.849976 | finish at 2025-09-10 12:00:42 + [2025-09-10 00:58:36] iteration 5009/ 11920 | consumed samples: 5129216 | elapsed time per iteration (ms): 6061.5 | throughput per GPU (TFLOP/s/GPU): 74.5 | MFU 7.53% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.728790E+00 | loss scale: 1.0 | grad norm: 0.275 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:38:11.277453 | finish at 2025-09-10 12:36:47 + [2025-09-10 00:58:41] iteration 5010/ 11920 | consumed samples: 5130240 | elapsed time per iteration (ms): 5753.5 | throughput per GPU (TFLOP/s/GPU): 78.5 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.717722E+00 | loss scale: 1.0 | grad norm: 0.249 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:02:36.454248 | finish at 2025-09-10 12:01:18 + [2025-09-10 00:58:47] iteration 5011/ 11920 | consumed samples: 5131264 | elapsed time per iteration (ms): 6066.3 | throughput per GPU (TFLOP/s/GPU): 74.4 | MFU 7.53% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.698990E+00 | loss scale: 1.0 | grad norm: 0.379 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:38:32.118826 | finish at 2025-09-10 12:37:20 + [2025-09-10 00:58:53] iteration 5012/ 11920 | consumed samples: 5132288 | elapsed time per iteration (ms): 5753.2 | throughput per GPU (TFLOP/s/GPU): 78.5 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.728865E+00 | loss scale: 1.0 | grad norm: 0.631 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:02:23.260792 | finish at 2025-09-10 12:01:16 + [2025-09-10 00:58:59] iteration 5013/ 11920 | consumed samples: 5133312 | elapsed time per iteration (ms): 5756.8 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.698679E+00 | loss scale: 1.0 | grad norm: 0.395 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:02:42.292911 | finish at 2025-09-10 12:01:41 + [2025-09-10 00:59:05] iteration 5014/ 11920 | consumed samples: 5134336 | elapsed time per iteration (ms): 5748.4 | throughput per GPU (TFLOP/s/GPU): 78.5 | MFU 7.94% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.702386E+00 | loss scale: 1.0 | grad norm: 0.602 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:01:38.660967 | finish at 2025-09-10 12:00:43 + [2025-09-10 00:59:10] iteration 5015/ 11920 | consumed samples: 5135360 | elapsed time per iteration (ms): 5758.7 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.778952E+00 | loss scale: 1.0 | grad norm: 0.836 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:02:43.842523 | finish at 2025-09-10 12:01:54 + [2025-09-10 00:59:16] iteration 5016/ 11920 | consumed samples: 5136384 | elapsed time per iteration (ms): 5733.1 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.711407E+00 | loss scale: 1.0 | grad norm: 0.532 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:59:41.200161 | finish at 2025-09-10 11:58:57 + [2025-09-10 00:59:22] iteration 5017/ 11920 | consumed samples: 5137408 | elapsed time per iteration (ms): 5739.9 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.672832E+00 | loss scale: 1.0 | grad norm: 0.487 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:00:22.874447 | finish at 2025-09-10 11:59:45 + [2025-09-10 00:59:28] iteration 5018/ 11920 | consumed samples: 5138432 | elapsed time per iteration (ms): 5754.5 | throughput per GPU (TFLOP/s/GPU): 78.5 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.698842E+00 | loss scale: 1.0 | grad norm: 0.811 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:01:57.278648 | finish at 2025-09-10 12:01:25 + [2025-09-10 00:59:33] iteration 5019/ 11920 | consumed samples: 5139456 | elapsed time per iteration (ms): 5762.7 | throughput per GPU (TFLOP/s/GPU): 78.3 | MFU 7.92% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.669451E+00 | loss scale: 1.0 | grad norm: 0.417 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:02:48.620313 | finish at 2025-09-10 12:02:22 + [2025-09-10 00:59:39] iteration 5020/ 11920 | consumed samples: 5140480 | elapsed time per iteration (ms): 5751.8 | throughput per GPU (TFLOP/s/GPU): 78.5 | MFU 7.94% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.612441E+00 | loss scale: 1.0 | grad norm: 0.266 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:01:27.083173 | finish at 2025-09-10 12:01:06 + [2025-09-10 00:59:45] iteration 5021/ 11920 | consumed samples: 5141504 | elapsed time per iteration (ms): 5739.2 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.578609E+00 | loss scale: 1.0 | grad norm: 0.295 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:59:54.731726 | finish at 2025-09-10 11:59:40 + [2025-09-10 00:59:51] iteration 5022/ 11920 | consumed samples: 5142528 | elapsed time per iteration (ms): 5733.2 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.562663E+00 | loss scale: 1.0 | grad norm: 0.365 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:59:07.344389 | finish at 2025-09-10 11:58:58 + [2025-09-10 00:59:56] iteration 5023/ 11920 | consumed samples: 5143552 | elapsed time per iteration (ms): 5768.3 | throughput per GPU (TFLOP/s/GPU): 78.3 | MFU 7.91% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.587249E+00 | loss scale: 1.0 | grad norm: 0.577 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:03:03.715545 | finish at 2025-09-10 12:03:00 + [2025-09-10 01:00:02] iteration 5024/ 11920 | consumed samples: 5144576 | elapsed time per iteration (ms): 5929.5 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.562405E+00 | loss scale: 1.0 | grad norm: 0.503 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:21:29.697891 | finish at 2025-09-10 12:21:32 + [2025-09-10 01:00:08] iteration 5025/ 11920 | consumed samples: 5145600 | elapsed time per iteration (ms): 5739.9 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.519308E+00 | loss scale: 1.0 | grad norm: 0.397 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:59:36.900599 | finish at 2025-09-10 11:59:45 + [2025-09-10 01:00:14] iteration 5026/ 11920 | consumed samples: 5146624 | elapsed time per iteration (ms): 5765.7 | throughput per GPU (TFLOP/s/GPU): 78.3 | MFU 7.92% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.523870E+00 | loss scale: 1.0 | grad norm: 0.528 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:02:28.815398 | finish at 2025-09-10 12:02:43 + [2025-09-10 01:00:20] iteration 5027/ 11920 | consumed samples: 5147648 | elapsed time per iteration (ms): 5760.7 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.92% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.556716E+00 | loss scale: 1.0 | grad norm: 0.861 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:01:48.674285 | finish at 2025-09-10 12:02:08 + [2025-09-10 01:00:25] iteration 5028/ 11920 | consumed samples: 5148672 | elapsed time per iteration (ms): 5740.0 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.532758E+00 | loss scale: 1.0 | grad norm: 0.469 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:59:19.739927 | finish at 2025-09-10 11:59:45 + [2025-09-10 01:00:31] iteration 5029/ 11920 | consumed samples: 5149696 | elapsed time per iteration (ms): 5767.7 | throughput per GPU (TFLOP/s/GPU): 78.3 | MFU 7.91% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.560362E+00 | loss scale: 1.0 | grad norm: 0.686 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:02:25.126755 | finish at 2025-09-10 12:02:56 + [2025-09-10 01:00:37] iteration 5030/ 11920 | consumed samples: 5150720 | elapsed time per iteration (ms): 5716.2 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.544920E+00 | loss scale: 1.0 | grad norm: 0.654 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:56:24.462724 | finish at 2025-09-10 11:57:01 + [2025-09-10 01:00:43] iteration 5031/ 11920 | consumed samples: 5151744 | elapsed time per iteration (ms): 5748.7 | throughput per GPU (TFLOP/s/GPU): 78.5 | MFU 7.94% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.512163E+00 | loss scale: 1.0 | grad norm: 0.506 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:00:02.841267 | finish at 2025-09-10 12:00:45 + [2025-09-10 01:00:49] iteration 5032/ 11920 | consumed samples: 5152768 | elapsed time per iteration (ms): 6085.6 | throughput per GPU (TFLOP/s/GPU): 74.2 | MFU 7.50% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.467641E+00 | loss scale: 1.0 | grad norm: 0.395 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:38:37.429819 | finish at 2025-09-10 12:39:26 + [2025-09-10 01:00:54] iteration 5033/ 11920 | consumed samples: 5153792 | elapsed time per iteration (ms): 5738.4 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.444949E+00 | loss scale: 1.0 | grad norm: 0.333 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:58:40.583990 | finish at 2025-09-10 11:59:35 + [2025-09-10 01:01:00] iteration 5034/ 11920 | consumed samples: 5154816 | elapsed time per iteration (ms): 5738.5 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.407686E+00 | loss scale: 1.0 | grad norm: 0.294 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:58:35.288830 | finish at 2025-09-10 11:59:35 + [2025-09-10 01:01:06] iteration 5035/ 11920 | consumed samples: 5155840 | elapsed time per iteration (ms): 5728.3 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.393176E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:57:19.047396 | finish at 2025-09-10 11:58:25 + [2025-09-10 01:01:12] iteration 5036/ 11920 | consumed samples: 5156864 | elapsed time per iteration (ms): 5716.0 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.353098E+00 | loss scale: 1.0 | grad norm: 0.257 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:55:49.185819 | finish at 2025-09-10 11:57:01 + [2025-09-10 01:01:18] iteration 5037/ 11920 | consumed samples: 5157888 | elapsed time per iteration (ms): 5909.7 | throughput per GPU (TFLOP/s/GPU): 76.4 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.345948E+00 | loss scale: 1.0 | grad norm: 0.286 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:17:56.625349 | finish at 2025-09-10 12:19:14 + [2025-09-10 01:01:23] iteration 5038/ 11920 | consumed samples: 5158912 | elapsed time per iteration (ms): 5724.7 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.344126E+00 | loss scale: 1.0 | grad norm: 0.366 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:56:37.688769 | finish at 2025-09-10 11:58:01 + [2025-09-10 01:01:29] iteration 5039/ 11920 | consumed samples: 5159936 | elapsed time per iteration (ms): 5718.8 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.316699E+00 | loss scale: 1.0 | grad norm: 0.484 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:55:50.835230 | finish at 2025-09-10 11:57:20 + [2025-09-10 01:01:35] iteration 5040/ 11920 | consumed samples: 5160960 | elapsed time per iteration (ms): 5718.1 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.325324E+00 | loss scale: 1.0 | grad norm: 0.487 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:55:40.318527 | finish at 2025-09-10 11:57:15 + [2025-09-10 01:01:40] iteration 5041/ 11920 | consumed samples: 5161984 | elapsed time per iteration (ms): 5758.1 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.345520E+00 | loss scale: 1.0 | grad norm: 0.603 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:00:09.906163 | finish at 2025-09-10 12:01:50 + [2025-09-10 01:01:46] iteration 5042/ 11920 | consumed samples: 5163008 | elapsed time per iteration (ms): 5719.9 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.298110E+00 | loss scale: 1.0 | grad norm: 0.408 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:55:41.528857 | finish at 2025-09-10 11:57:28 + [2025-09-10 01:01:52] iteration 5043/ 11920 | consumed samples: 5164032 | elapsed time per iteration (ms): 5722.3 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.260552E+00 | loss scale: 1.0 | grad norm: 0.323 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:55:52.239426 | finish at 2025-09-10 11:57:44 + [2025-09-10 01:01:58] iteration 5044/ 11920 | consumed samples: 5165056 | elapsed time per iteration (ms): 5751.7 | throughput per GPU (TFLOP/s/GPU): 78.5 | MFU 7.94% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.277396E+00 | loss scale: 1.0 | grad norm: 0.461 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:59:09.024751 | finish at 2025-09-10 12:01:07 + [2025-09-10 01:02:04] iteration 5045/ 11920 | consumed samples: 5166080 | elapsed time per iteration (ms): 5976.6 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.256122E+00 | loss scale: 1.0 | grad norm: 0.423 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:24:49.354008 | finish at 2025-09-10 12:26:53 + [2025-09-10 01:02:09] iteration 5046/ 11920 | consumed samples: 5167104 | elapsed time per iteration (ms): 5725.4 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.227739E+00 | loss scale: 1.0 | grad norm: 0.451 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:55:56.704234 | finish at 2025-09-10 11:58:06 + [2025-09-10 01:02:15] iteration 5047/ 11920 | consumed samples: 5168128 | elapsed time per iteration (ms): 5731.7 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.220477E+00 | loss scale: 1.0 | grad norm: 0.391 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:56:34.116275 | finish at 2025-09-10 11:58:49 + [2025-09-10 01:02:21] iteration 5048/ 11920 | consumed samples: 5169152 | elapsed time per iteration (ms): 5733.6 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.203416E+00 | loss scale: 1.0 | grad norm: 0.477 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:56:41.144510 | finish at 2025-09-10 11:59:02 + [2025-09-10 01:02:27] iteration 5049/ 11920 | consumed samples: 5170176 | elapsed time per iteration (ms): 5945.0 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.200764E+00 | loss scale: 1.0 | grad norm: 0.656 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:20:48.251806 | finish at 2025-09-10 12:23:15 + [2025-09-10 01:02:32] iteration 5050/ 11920 | consumed samples: 5171200 | elapsed time per iteration (ms): 5737.2 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.203763E+00 | loss scale: 1.0 | grad norm: 0.571 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:56:54.403632 | finish at 2025-09-10 11:59:27 + [2025-09-10 01:02:38] iteration 5051/ 11920 | consumed samples: 5172224 | elapsed time per iteration (ms): 5967.1 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.215990E+00 | loss scale: 1.0 | grad norm: 0.741 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:23:07.747216 | finish at 2025-09-10 12:25:46 + [2025-09-10 01:02:44] iteration 5052/ 11920 | consumed samples: 5173248 | elapsed time per iteration (ms): 5722.3 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.202557E+00 | loss scale: 1.0 | grad norm: 0.637 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:55:00.494768 | finish at 2025-09-10 11:57:45 + [2025-09-10 01:02:50] iteration 5053/ 11920 | consumed samples: 5174272 | elapsed time per iteration (ms): 5720.8 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.155849E+00 | loss scale: 1.0 | grad norm: 0.503 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:54:44.719973 | finish at 2025-09-10 11:57:35 + [2025-09-10 01:02:56] iteration 5054/ 11920 | consumed samples: 5175296 | elapsed time per iteration (ms): 5730.7 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.167494E+00 | loss scale: 1.0 | grad norm: 0.585 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:55:46.712934 | finish at 2025-09-10 11:58:42 + [2025-09-10 01:03:01] iteration 5055/ 11920 | consumed samples: 5176320 | elapsed time per iteration (ms): 5770.5 | throughput per GPU (TFLOP/s/GPU): 78.2 | MFU 7.91% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.173905E+00 | loss scale: 1.0 | grad norm: 0.647 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:00:14.762003 | finish at 2025-09-10 12:03:16 + [2025-09-10 01:03:07] iteration 5056/ 11920 | consumed samples: 5177344 | elapsed time per iteration (ms): 6105.1 | throughput per GPU (TFLOP/s/GPU): 74.0 | MFU 7.48% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.127722E+00 | loss scale: 1.0 | grad norm: 0.445 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:38:25.510563 | finish at 2025-09-10 12:41:33 + [2025-09-10 01:03:13] iteration 5057/ 11920 | consumed samples: 5178368 | elapsed time per iteration (ms): 5709.5 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 8.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.112114E+00 | loss scale: 1.0 | grad norm: 0.435 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:53:04.506816 | finish at 2025-09-10 11:56:18 + [2025-09-10 01:03:19] iteration 5058/ 11920 | consumed samples: 5179392 | elapsed time per iteration (ms): 5713.8 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.086646E+00 | loss scale: 1.0 | grad norm: 0.320 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:53:27.964398 | finish at 2025-09-10 11:56:47 + [2025-09-10 01:03:25] iteration 5059/ 11920 | consumed samples: 5180416 | elapsed time per iteration (ms): 5739.5 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.092855E+00 | loss scale: 1.0 | grad norm: 0.546 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:56:18.940460 | finish at 2025-09-10 11:59:44 + [2025-09-10 01:03:30] iteration 5060/ 11920 | consumed samples: 5181440 | elapsed time per iteration (ms): 5728.4 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.116822E+00 | loss scale: 1.0 | grad norm: 0.804 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:54:56.660390 | finish at 2025-09-10 11:58:27 + [2025-09-10 01:03:36] iteration 5061/ 11920 | consumed samples: 5182464 | elapsed time per iteration (ms): 5751.1 | throughput per GPU (TFLOP/s/GPU): 78.5 | MFU 7.94% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.066041E+00 | loss scale: 1.0 | grad norm: 0.400 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:57:26.556579 | finish at 2025-09-10 12:01:03 + [2025-09-10 01:03:42] iteration 5062/ 11920 | consumed samples: 5183488 | elapsed time per iteration (ms): 6082.6 | throughput per GPU (TFLOP/s/GPU): 74.2 | MFU 7.51% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.057715E+00 | loss scale: 1.0 | grad norm: 0.415 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:35:14.169112 | finish at 2025-09-10 12:38:56 + [2025-09-10 01:03:48] iteration 5063/ 11920 | consumed samples: 5184512 | elapsed time per iteration (ms): 5753.4 | throughput per GPU (TFLOP/s/GPU): 78.5 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.025198E+00 | loss scale: 1.0 | grad norm: 0.286 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:57:31.201725 | finish at 2025-09-10 12:01:19 + [2025-09-10 01:03:54] iteration 5064/ 11920 | consumed samples: 5185536 | elapsed time per iteration (ms): 5751.6 | throughput per GPU (TFLOP/s/GPU): 78.5 | MFU 7.94% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.015827E+00 | loss scale: 1.0 | grad norm: 0.383 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:57:12.757288 | finish at 2025-09-10 12:01:06 + [2025-09-10 01:03:59] iteration 5065/ 11920 | consumed samples: 5186560 | elapsed time per iteration (ms): 5770.6 | throughput per GPU (TFLOP/s/GPU): 78.2 | MFU 7.91% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.009456E+00 | loss scale: 1.0 | grad norm: 0.514 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:59:17.295213 | finish at 2025-09-10 12:03:17 + [2025-09-10 01:04:05] iteration 5066/ 11920 | consumed samples: 5187584 | elapsed time per iteration (ms): 5722.6 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.003602E+00 | loss scale: 1.0 | grad norm: 0.530 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:53:42.554848 | finish at 2025-09-10 11:57:48 + [2025-09-10 01:04:11] iteration 5067/ 11920 | consumed samples: 5188608 | elapsed time per iteration (ms): 5718.8 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.980409E+00 | loss scale: 1.0 | grad norm: 0.419 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:53:11.240767 | finish at 2025-09-10 11:57:22 + [2025-09-10 01:04:17] iteration 5068/ 11920 | consumed samples: 5189632 | elapsed time per iteration (ms): 6088.8 | throughput per GPU (TFLOP/s/GPU): 74.2 | MFU 7.50% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.962220E+00 | loss scale: 1.0 | grad norm: 0.285 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:35:20.347827 | finish at 2025-09-10 12:39:37 + [2025-09-10 01:04:23] iteration 5069/ 11920 | consumed samples: 5190656 | elapsed time per iteration (ms): 6068.1 | throughput per GPU (TFLOP/s/GPU): 74.4 | MFU 7.52% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.939515E+00 | loss scale: 1.0 | grad norm: 0.397 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:32:52.887781 | finish at 2025-09-10 12:37:16 + [2025-09-10 01:04:29] iteration 5070/ 11920 | consumed samples: 5191680 | elapsed time per iteration (ms): 6029.5 | throughput per GPU (TFLOP/s/GPU): 74.9 | MFU 7.57% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.010819E+00 | loss scale: 1.0 | grad norm: 1.039 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:28:22.240002 | finish at 2025-09-10 12:32:51 + [2025-09-10 01:04:35] iteration 5071/ 11920 | consumed samples: 5192704 | elapsed time per iteration (ms): 5964.5 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.961133E+00 | loss scale: 1.0 | grad norm: 0.642 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:20:50.664209 | finish at 2025-09-10 12:25:26 + [2025-09-10 01:04:41] iteration 5072/ 11920 | consumed samples: 5193728 | elapsed time per iteration (ms): 6093.0 | throughput per GPU (TFLOP/s/GPU): 74.1 | MFU 7.49% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.951935E+00 | loss scale: 1.0 | grad norm: 0.612 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:35:25.191727 | finish at 2025-09-10 12:40:06 + [2025-09-10 01:04:47] iteration 5073/ 11920 | consumed samples: 5194752 | elapsed time per iteration (ms): 6078.9 | throughput per GPU (TFLOP/s/GPU): 74.3 | MFU 7.51% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.932241E+00 | loss scale: 1.0 | grad norm: 0.521 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:33:41.971049 | finish at 2025-09-10 12:38:29 + [2025-09-10 01:04:53] iteration 5074/ 11920 | consumed samples: 5195776 | elapsed time per iteration (ms): 6047.8 | throughput per GPU (TFLOP/s/GPU): 74.7 | MFU 7.55% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.915765E+00 | loss scale: 1.0 | grad norm: 0.352 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:30:03.216388 | finish at 2025-09-10 12:34:57 + [2025-09-10 01:04:59] iteration 5075/ 11920 | consumed samples: 5196800 | elapsed time per iteration (ms): 6035.9 | throughput per GPU (TFLOP/s/GPU): 74.8 | MFU 7.56% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.884011E+00 | loss scale: 1.0 | grad norm: 0.386 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:28:35.889699 | finish at 2025-09-10 12:33:35 + [2025-09-10 01:05:05] iteration 5076/ 11920 | consumed samples: 5197824 | elapsed time per iteration (ms): 6019.3 | throughput per GPU (TFLOP/s/GPU): 75.0 | MFU 7.58% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.891272E+00 | loss scale: 1.0 | grad norm: 0.626 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:26:36.395857 | finish at 2025-09-10 12:31:42 + [2025-09-10 01:05:11] iteration 5077/ 11920 | consumed samples: 5198848 | elapsed time per iteration (ms): 5767.9 | throughput per GPU (TFLOP/s/GPU): 78.3 | MFU 7.91% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.892499E+00 | loss scale: 1.0 | grad norm: 0.582 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:57:49.473698 | finish at 2025-09-10 12:03:01 + [2025-09-10 01:05:17] iteration 5078/ 11920 | consumed samples: 5199872 | elapsed time per iteration (ms): 5743.5 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.852273E+00 | loss scale: 1.0 | grad norm: 0.446 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:54:56.772484 | finish at 2025-09-10 12:00:14 + [2025-09-10 01:05:23] iteration 5079/ 11920 | consumed samples: 5200896 | elapsed time per iteration (ms): 6012.0 | throughput per GPU (TFLOP/s/GPU): 75.1 | MFU 7.59% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.863675E+00 | loss scale: 1.0 | grad norm: 0.746 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:25:28.307869 | finish at 2025-09-10 12:30:51 + [2025-09-10 01:05:29] iteration 5080/ 11920 | consumed samples: 5201920 | elapsed time per iteration (ms): 5736.6 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.872137E+00 | loss scale: 1.0 | grad norm: 0.870 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:53:58.628855 | finish at 2025-09-10 11:59:27 + [2025-09-10 01:05:34] iteration 5081/ 11920 | consumed samples: 5202944 | elapsed time per iteration (ms): 5740.6 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.889924E+00 | loss scale: 1.0 | grad norm: 0.832 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:54:19.716303 | finish at 2025-09-10 11:59:54 + [2025-09-10 01:05:40] iteration 5082/ 11920 | consumed samples: 5203968 | elapsed time per iteration (ms): 5731.5 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.836839E+00 | loss scale: 1.0 | grad norm: 0.332 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:53:12.234412 | finish at 2025-09-10 11:58:52 + [2025-09-10 01:05:46] iteration 5083/ 11920 | consumed samples: 5204992 | elapsed time per iteration (ms): 6104.0 | throughput per GPU (TFLOP/s/GPU): 74.0 | MFU 7.48% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.792670E+00 | loss scale: 1.0 | grad norm: 0.317 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:35:33.086118 | finish at 2025-09-10 12:41:19 + [2025-09-10 01:05:52] iteration 5084/ 11920 | consumed samples: 5206016 | elapsed time per iteration (ms): 5729.8 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.796555E+00 | loss scale: 1.0 | grad norm: 0.384 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:52:49.183255 | finish at 2025-09-10 11:58:41 + [2025-09-10 01:05:58] iteration 5085/ 11920 | consumed samples: 5207040 | elapsed time per iteration (ms): 5738.0 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.773231E+00 | loss scale: 1.0 | grad norm: 0.448 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:53:39.465717 | finish at 2025-09-10 11:59:37 + [2025-09-10 01:06:03] iteration 5086/ 11920 | consumed samples: 5208064 | elapsed time per iteration (ms): 5738.8 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.783799E+00 | loss scale: 1.0 | grad norm: 0.490 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:53:39.282146 | finish at 2025-09-10 11:59:43 + [2025-09-10 01:06:09] iteration 5087/ 11920 | consumed samples: 5209088 | elapsed time per iteration (ms): 5741.2 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.790736E+00 | loss scale: 1.0 | grad norm: 0.558 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:53:49.435307 | finish at 2025-09-10 11:59:59 + [2025-09-10 01:06:15] iteration 5088/ 11920 | consumed samples: 5210112 | elapsed time per iteration (ms): 5730.8 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.770733E+00 | loss scale: 1.0 | grad norm: 0.535 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:52:32.730534 | finish at 2025-09-10 11:58:48 + [2025-09-10 01:06:21] iteration 5089/ 11920 | consumed samples: 5211136 | elapsed time per iteration (ms): 6052.4 | throughput per GPU (TFLOP/s/GPU): 74.6 | MFU 7.54% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.753321E+00 | loss scale: 1.0 | grad norm: 0.565 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:29:03.945166 | finish at 2025-09-10 12:35:25 + [2025-09-10 01:06:27] iteration 5090/ 11920 | consumed samples: 5212160 | elapsed time per iteration (ms): 5747.6 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.94% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.782215E+00 | loss scale: 1.0 | grad norm: 0.671 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:54:16.079226 | finish at 2025-09-10 12:00:43 + [2025-09-10 01:06:33] iteration 5091/ 11920 | consumed samples: 5213184 | elapsed time per iteration (ms): 5955.1 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.758415E+00 | loss scale: 1.0 | grad norm: 0.523 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:17:47.239913 | finish at 2025-09-10 12:24:20 + [2025-09-10 01:06:38] iteration 5092/ 11920 | consumed samples: 5214208 | elapsed time per iteration (ms): 5750.5 | throughput per GPU (TFLOP/s/GPU): 78.5 | MFU 7.94% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.713290E+00 | loss scale: 1.0 | grad norm: 0.411 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:54:24.327473 | finish at 2025-09-10 12:01:03 + [2025-09-10 01:06:44] iteration 5093/ 11920 | consumed samples: 5215232 | elapsed time per iteration (ms): 6034.0 | throughput per GPU (TFLOP/s/GPU): 74.8 | MFU 7.57% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.709271E+00 | loss scale: 1.0 | grad norm: 0.668 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:26:34.355095 | finish at 2025-09-10 12:33:19 + [2025-09-10 01:06:50] iteration 5094/ 11920 | consumed samples: 5216256 | elapsed time per iteration (ms): 5755.0 | throughput per GPU (TFLOP/s/GPU): 78.5 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.726514E+00 | loss scale: 1.0 | grad norm: 0.676 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:54:43.808173 | finish at 2025-09-10 12:01:34 + [2025-09-10 01:06:56] iteration 5095/ 11920 | consumed samples: 5217280 | elapsed time per iteration (ms): 6064.0 | throughput per GPU (TFLOP/s/GPU): 74.5 | MFU 7.53% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.706372E+00 | loss scale: 1.0 | grad norm: 0.497 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:29:46.869228 | finish at 2025-09-10 12:36:43 + [2025-09-10 01:07:02] iteration 5096/ 11920 | consumed samples: 5218304 | elapsed time per iteration (ms): 5737.4 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.717096E+00 | loss scale: 1.0 | grad norm: 0.929 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:52:32.017975 | finish at 2025-09-10 11:59:34 + [2025-09-10 01:07:08] iteration 5097/ 11920 | consumed samples: 5219328 | elapsed time per iteration (ms): 5724.5 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.699356E+00 | loss scale: 1.0 | grad norm: 0.649 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:50:58.243576 | finish at 2025-09-10 11:58:06 + [2025-09-10 01:07:13] iteration 5098/ 11920 | consumed samples: 5220352 | elapsed time per iteration (ms): 5735.0 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.697300E+00 | loss scale: 1.0 | grad norm: 0.929 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:52:03.856998 | finish at 2025-09-10 11:59:17 + [2025-09-10 01:07:19] iteration 5099/ 11920 | consumed samples: 5221376 | elapsed time per iteration (ms): 5750.4 | throughput per GPU (TFLOP/s/GPU): 78.5 | MFU 7.94% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.670956E+00 | loss scale: 1.0 | grad norm: 0.656 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:53:43.256056 | finish at 2025-09-10 12:01:02 + [2025-09-10 01:07:25] iteration 5100/ 11920 | consumed samples: 5222400 | elapsed time per iteration (ms): 5735.7 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.695152E+00 | loss scale: 1.0 | grad norm: 0.823 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:51:57.424483 | finish at 2025-09-10 11:59:22 + [2025-09-10 01:07:31] iteration 5101/ 11920 | consumed samples: 5223424 | elapsed time per iteration (ms): 5744.2 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.665432E+00 | loss scale: 1.0 | grad norm: 0.487 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:52:49.364830 | finish at 2025-09-10 12:00:20 + [2025-09-10 01:07:36] iteration 5102/ 11920 | consumed samples: 5224448 | elapsed time per iteration (ms): 5736.0 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.640875E+00 | loss scale: 1.0 | grad norm: 0.504 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:51:48.124816 | finish at 2025-09-10 11:59:25 + [2025-09-10 01:07:42] iteration 5103/ 11920 | consumed samples: 5225472 | elapsed time per iteration (ms): 5731.5 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.647902E+00 | loss scale: 1.0 | grad norm: 0.383 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:51:11.865681 | finish at 2025-09-10 11:58:54 + [2025-09-10 01:07:48] iteration 5104/ 11920 | consumed samples: 5226496 | elapsed time per iteration (ms): 5965.2 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.623957E+00 | loss scale: 1.0 | grad norm: 0.541 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:17:38.887344 | finish at 2025-09-10 12:25:27 + [2025-09-10 01:07:54] iteration 5105/ 11920 | consumed samples: 5227520 | elapsed time per iteration (ms): 5734.5 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.636268E+00 | loss scale: 1.0 | grad norm: 0.743 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:51:20.439926 | finish at 2025-09-10 11:59:14 + [2025-09-10 01:08:00] iteration 5106/ 11920 | consumed samples: 5228544 | elapsed time per iteration (ms): 5720.9 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.621226E+00 | loss scale: 1.0 | grad norm: 0.512 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:49:41.961190 | finish at 2025-09-10 11:57:42 + [2025-09-10 01:08:06] iteration 5107/ 11920 | consumed samples: 5229568 | elapsed time per iteration (ms): 5990.8 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.623144E+00 | loss scale: 1.0 | grad norm: 0.675 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:20:15.110204 | finish at 2025-09-10 12:28:21 + [2025-09-10 01:08:11] iteration 5108/ 11920 | consumed samples: 5230592 | elapsed time per iteration (ms): 5732.5 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.602149E+00 | loss scale: 1.0 | grad norm: 0.430 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:50:49.896087 | finish at 2025-09-10 11:59:01 + [2025-09-10 01:08:17] iteration 5109/ 11920 | consumed samples: 5231616 | elapsed time per iteration (ms): 5992.8 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.597976E+00 | loss scale: 1.0 | grad norm: 0.707 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:20:16.684724 | finish at 2025-09-10 12:28:34 + [2025-09-10 01:08:23] iteration 5110/ 11920 | consumed samples: 5232640 | elapsed time per iteration (ms): 5736.1 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.593043E+00 | loss scale: 1.0 | grad norm: 0.736 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:51:02.668612 | finish at 2025-09-10 11:59:26 + [2025-09-10 01:08:29] iteration 5111/ 11920 | consumed samples: 5233664 | elapsed time per iteration (ms): 5710.7 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.576783E+00 | loss scale: 1.0 | grad norm: 0.592 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:48:04.362710 | finish at 2025-09-10 11:56:33 + [2025-09-10 01:08:34] iteration 5112/ 11920 | consumed samples: 5234688 | elapsed time per iteration (ms): 5722.6 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.593205E+00 | loss scale: 1.0 | grad norm: 0.900 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:49:19.171764 | finish at 2025-09-10 11:57:54 + [2025-09-10 01:08:40] iteration 5113/ 11920 | consumed samples: 5235712 | elapsed time per iteration (ms): 5713.4 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.560828E+00 | loss scale: 1.0 | grad norm: 0.445 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:48:11.289929 | finish at 2025-09-10 11:56:51 + [2025-09-10 01:08:46] iteration 5114/ 11920 | consumed samples: 5236736 | elapsed time per iteration (ms): 5700.4 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.528843E+00 | loss scale: 1.0 | grad norm: 0.428 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:46:36.660303 | finish at 2025-09-10 11:55:23 + [2025-09-10 01:08:52] iteration 5115/ 11920 | consumed samples: 5237760 | elapsed time per iteration (ms): 5701.8 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.536749E+00 | loss scale: 1.0 | grad norm: 0.379 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:46:40.613450 | finish at 2025-09-10 11:55:32 + [2025-09-10 01:08:58] iteration 5116/ 11920 | consumed samples: 5238784 | elapsed time per iteration (ms): 6007.2 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.515548E+00 | loss scale: 1.0 | grad norm: 0.379 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:21:12.854176 | finish at 2025-09-10 12:30:10 + [2025-09-10 01:09:03] iteration 5117/ 11920 | consumed samples: 5239808 | elapsed time per iteration (ms): 5705.5 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 8.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.512373E+00 | loss scale: 1.0 | grad norm: 0.431 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:46:54.614674 | finish at 2025-09-10 11:55:58 + [2025-09-10 01:09:09] iteration 5118/ 11920 | consumed samples: 5240832 | elapsed time per iteration (ms): 6071.3 | throughput per GPU (TFLOP/s/GPU): 74.4 | MFU 7.52% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.487818E+00 | loss scale: 1.0 | grad norm: 0.437 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:28:17.044428 | finish at 2025-09-10 12:37:26 + [2025-09-10 01:09:15] iteration 5119/ 11920 | consumed samples: 5241856 | elapsed time per iteration (ms): 6049.5 | throughput per GPU (TFLOP/s/GPU): 74.6 | MFU 7.55% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.489589E+00 | loss scale: 1.0 | grad norm: 0.832 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:25:42.735361 | finish at 2025-09-10 12:34:58 + [2025-09-10 01:09:21] iteration 5120/ 11920 | consumed samples: 5242880 | elapsed time per iteration (ms): 5728.8 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.492113E+00 | loss scale: 1.0 | grad norm: 0.678 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:49:15.973339 | finish at 2025-09-10 11:58:37 + [2025-09-10 01:09:27] iteration 5121/ 11920 | consumed samples: 5243904 | elapsed time per iteration (ms): 5722.9 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.479174E+00 | loss scale: 1.0 | grad norm: 0.716 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:48:30.226697 | finish at 2025-09-10 11:57:57 + [2025-09-10 01:09:33] iteration 5122/ 11920 | consumed samples: 5244928 | elapsed time per iteration (ms): 5693.4 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.481347E+00 | loss scale: 1.0 | grad norm: 0.554 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:45:03.708251 | finish at 2025-09-10 11:54:36 + [2025-09-10 01:09:38] iteration 5123/ 11920 | consumed samples: 5245952 | elapsed time per iteration (ms): 5711.0 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.478175E+00 | loss scale: 1.0 | grad norm: 0.562 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:46:57.610048 | finish at 2025-09-10 11:56:36 + [2025-09-10 01:09:44] iteration 5124/ 11920 | consumed samples: 5246976 | elapsed time per iteration (ms): 5689.4 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.455372E+00 | loss scale: 1.0 | grad norm: 0.431 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:44:25.189657 | finish at 2025-09-10 11:54:09 + [2025-09-10 01:09:50] iteration 5125/ 11920 | consumed samples: 5248000 | elapsed time per iteration (ms): 5684.9 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.451990E+00 | loss scale: 1.0 | grad norm: 0.354 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:43:48.589618 | finish at 2025-09-10 11:53:38 + [2025-09-10 01:09:55] iteration 5126/ 11920 | consumed samples: 5249024 | elapsed time per iteration (ms): 5691.5 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.421726E+00 | loss scale: 1.0 | grad norm: 0.330 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:44:27.765563 | finish at 2025-09-10 11:54:23 + [2025-09-10 01:10:01] iteration 5127/ 11920 | consumed samples: 5250048 | elapsed time per iteration (ms): 5913.5 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.430838E+00 | loss scale: 1.0 | grad norm: 0.443 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:09:30.459425 | finish at 2025-09-10 12:19:32 + [2025-09-10 01:10:07] iteration 5128/ 11920 | consumed samples: 5251072 | elapsed time per iteration (ms): 5700.1 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.415832E+00 | loss scale: 1.0 | grad norm: 0.660 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:45:15.056156 | finish at 2025-09-10 11:55:22 + [2025-09-10 01:10:13] iteration 5129/ 11920 | consumed samples: 5252096 | elapsed time per iteration (ms): 5700.9 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.425804E+00 | loss scale: 1.0 | grad norm: 0.539 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:45:14.979196 | finish at 2025-09-10 11:55:28 + [2025-09-10 01:10:18] iteration 5130/ 11920 | consumed samples: 5253120 | elapsed time per iteration (ms): 5694.2 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.403416E+00 | loss scale: 1.0 | grad norm: 0.531 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:44:23.668449 | finish at 2025-09-10 11:54:42 + [2025-09-10 01:10:24] iteration 5131/ 11920 | consumed samples: 5254144 | elapsed time per iteration (ms): 5679.7 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.402330E+00 | loss scale: 1.0 | grad norm: 0.469 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:42:39.179923 | finish at 2025-09-10 11:53:03 + [2025-09-10 01:10:30] iteration 5132/ 11920 | consumed samples: 5255168 | elapsed time per iteration (ms): 5705.0 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 8.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.417780E+00 | loss scale: 1.0 | grad norm: 0.894 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:45:25.824318 | finish at 2025-09-10 11:55:56 + [2025-09-10 01:10:35] iteration 5133/ 11920 | consumed samples: 5256192 | elapsed time per iteration (ms): 5710.6 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.433084E+00 | loss scale: 1.0 | grad norm: 1.173 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:45:57.637630 | finish at 2025-09-10 11:56:33 + [2025-09-10 01:10:41] iteration 5134/ 11920 | consumed samples: 5257216 | elapsed time per iteration (ms): 5686.9 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.419848E+00 | loss scale: 1.0 | grad norm: 0.776 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:43:11.021208 | finish at 2025-09-10 11:53:52 + [2025-09-10 01:10:47] iteration 5135/ 11920 | consumed samples: 5258240 | elapsed time per iteration (ms): 5699.0 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.393989E+00 | loss scale: 1.0 | grad norm: 0.621 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:44:27.761110 | finish at 2025-09-10 11:55:15 + [2025-09-10 01:10:53] iteration 5136/ 11920 | consumed samples: 5259264 | elapsed time per iteration (ms): 5881.1 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.393338E+00 | loss scale: 1.0 | grad norm: 0.583 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:04:57.056885 | finish at 2025-09-10 12:15:50 + [2025-09-10 01:10:58] iteration 5137/ 11920 | consumed samples: 5260288 | elapsed time per iteration (ms): 5695.6 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.391678E+00 | loss scale: 1.0 | grad norm: 0.657 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:43:53.115944 | finish at 2025-09-10 11:54:51 + [2025-09-10 01:11:04] iteration 5138/ 11920 | consumed samples: 5261312 | elapsed time per iteration (ms): 5687.6 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.381315E+00 | loss scale: 1.0 | grad norm: 0.757 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:42:53.226507 | finish at 2025-09-10 11:53:57 + [2025-09-10 01:11:10] iteration 5139/ 11920 | consumed samples: 5262336 | elapsed time per iteration (ms): 5924.1 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.370242E+00 | loss scale: 1.0 | grad norm: 0.630 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:09:31.331273 | finish at 2025-09-10 12:20:41 + [2025-09-10 01:11:16] iteration 5140/ 11920 | consumed samples: 5263360 | elapsed time per iteration (ms): 5673.9 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.351912E+00 | loss scale: 1.0 | grad norm: 0.614 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:41:09.356461 | finish at 2025-09-10 11:52:25 + [2025-09-10 01:11:21] iteration 5141/ 11920 | consumed samples: 5264384 | elapsed time per iteration (ms): 5686.5 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.349187E+00 | loss scale: 1.0 | grad norm: 0.489 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:42:29.116937 | finish at 2025-09-10 11:53:50 + [2025-09-10 01:11:27] iteration 5142/ 11920 | consumed samples: 5265408 | elapsed time per iteration (ms): 5671.7 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.325223E+00 | loss scale: 1.0 | grad norm: 0.353 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:40:42.511117 | finish at 2025-09-10 11:52:10 + [2025-09-10 01:11:33] iteration 5143/ 11920 | consumed samples: 5266432 | elapsed time per iteration (ms): 5667.5 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.337999E+00 | loss scale: 1.0 | grad norm: 0.435 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:40:08.893225 | finish at 2025-09-10 11:51:42 + [2025-09-10 01:11:38] iteration 5144/ 11920 | consumed samples: 5267456 | elapsed time per iteration (ms): 5686.3 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.323224E+00 | loss scale: 1.0 | grad norm: 0.650 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:42:10.105736 | finish at 2025-09-10 11:53:48 + [2025-09-10 01:11:44] iteration 5145/ 11920 | consumed samples: 5268480 | elapsed time per iteration (ms): 5880.6 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.333508E+00 | loss scale: 1.0 | grad norm: 0.514 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:04:00.736932 | finish at 2025-09-10 12:15:45 + [2025-09-10 01:11:50] iteration 5146/ 11920 | consumed samples: 5269504 | elapsed time per iteration (ms): 5662.5 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.324270E+00 | loss scale: 1.0 | grad norm: 0.389 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:39:18.095748 | finish at 2025-09-10 11:51:08 + [2025-09-10 01:11:56] iteration 5147/ 11920 | consumed samples: 5270528 | elapsed time per iteration (ms): 5675.9 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.328883E+00 | loss scale: 1.0 | grad norm: 0.784 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:40:42.998151 | finish at 2025-09-10 11:52:39 + [2025-09-10 01:12:01] iteration 5148/ 11920 | consumed samples: 5271552 | elapsed time per iteration (ms): 5700.5 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.336974E+00 | loss scale: 1.0 | grad norm: 1.005 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:43:24.033107 | finish at 2025-09-10 11:55:25 + [2025-09-10 01:12:07] iteration 5149/ 11920 | consumed samples: 5272576 | elapsed time per iteration (ms): 5671.7 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.325955E+00 | loss scale: 1.0 | grad norm: 0.660 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:40:03.082319 | finish at 2025-09-10 11:52:10 + [2025-09-10 01:12:13] iteration 5150/ 11920 | consumed samples: 5273600 | elapsed time per iteration (ms): 5698.1 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.352540E+00 | loss scale: 1.0 | grad norm: 1.764 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:42:56.315160 | finish at 2025-09-10 11:55:09 + [2025-09-10 01:12:18] iteration 5151/ 11920 | consumed samples: 5274624 | elapsed time per iteration (ms): 5674.1 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.314610E+00 | loss scale: 1.0 | grad norm: 0.371 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:40:08.043700 | finish at 2025-09-10 11:52:26 + [2025-09-10 01:12:24] iteration 5152/ 11920 | consumed samples: 5275648 | elapsed time per iteration (ms): 5669.6 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.307594E+00 | loss scale: 1.0 | grad norm: 0.608 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:39:31.568871 | finish at 2025-09-10 11:51:56 + [2025-09-10 01:12:30] iteration 5153/ 11920 | consumed samples: 5276672 | elapsed time per iteration (ms): 5695.9 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.313766E+00 | loss scale: 1.0 | grad norm: 0.755 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:42:24.140532 | finish at 2025-09-10 11:54:54 + [2025-09-10 01:12:35] iteration 5154/ 11920 | consumed samples: 5277696 | elapsed time per iteration (ms): 5684.5 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.286550E+00 | loss scale: 1.0 | grad norm: 0.351 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:41:01.354281 | finish at 2025-09-10 11:53:37 + [2025-09-10 01:12:41] iteration 5155/ 11920 | consumed samples: 5278720 | elapsed time per iteration (ms): 5690.7 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.280116E+00 | loss scale: 1.0 | grad norm: 0.397 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:41:37.403609 | finish at 2025-09-10 11:54:18 + [2025-09-10 01:12:47] iteration 5156/ 11920 | consumed samples: 5279744 | elapsed time per iteration (ms): 5673.1 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.283458E+00 | loss scale: 1.0 | grad norm: 0.618 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:39:33.135418 | finish at 2025-09-10 11:52:20 + [2025-09-10 01:12:52] iteration 5157/ 11920 | consumed samples: 5280768 | elapsed time per iteration (ms): 5695.0 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.276293E+00 | loss scale: 1.0 | grad norm: 0.963 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:41:55.142655 | finish at 2025-09-10 11:54:48 + [2025-09-10 01:12:58] iteration 5158/ 11920 | consumed samples: 5281792 | elapsed time per iteration (ms): 5698.3 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.322880E+00 | loss scale: 1.0 | grad norm: 1.698 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:42:11.807090 | finish at 2025-09-10 11:55:10 + [2025-09-10 01:13:04] iteration 5159/ 11920 | consumed samples: 5282816 | elapsed time per iteration (ms): 5678.9 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.285210E+00 | loss scale: 1.0 | grad norm: 0.401 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:39:55.208954 | finish at 2025-09-10 11:52:59 + [2025-09-10 01:13:10] iteration 5160/ 11920 | consumed samples: 5283840 | elapsed time per iteration (ms): 5894.7 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.266613E+00 | loss scale: 1.0 | grad norm: 0.484 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:04:07.903185 | finish at 2025-09-10 12:17:18 + [2025-09-10 01:13:15] iteration 5161/ 11920 | consumed samples: 5284864 | elapsed time per iteration (ms): 5668.9 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.259362E+00 | loss scale: 1.0 | grad norm: 0.415 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:38:35.856690 | finish at 2025-09-10 11:51:51 + [2025-09-10 01:13:21] iteration 5162/ 11920 | consumed samples: 5285888 | elapsed time per iteration (ms): 5684.1 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.267327E+00 | loss scale: 1.0 | grad norm: 0.843 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:40:13.218104 | finish at 2025-09-10 11:53:34 + [2025-09-10 01:13:27] iteration 5163/ 11920 | consumed samples: 5286912 | elapsed time per iteration (ms): 5685.7 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.295721E+00 | loss scale: 1.0 | grad norm: 1.509 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:40:17.995791 | finish at 2025-09-10 11:53:45 + [2025-09-10 01:13:32] iteration 5164/ 11920 | consumed samples: 5287936 | elapsed time per iteration (ms): 5679.7 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.278401E+00 | loss scale: 1.0 | grad norm: 0.816 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:39:32.084724 | finish at 2025-09-10 11:53:05 + [2025-09-10 01:13:38] iteration 5165/ 11920 | consumed samples: 5288960 | elapsed time per iteration (ms): 5684.3 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.251199E+00 | loss scale: 1.0 | grad norm: 0.475 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:39:57.710259 | finish at 2025-09-10 11:53:36 + [2025-09-10 01:13:44] iteration 5166/ 11920 | consumed samples: 5289984 | elapsed time per iteration (ms): 5686.4 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.253616E+00 | loss scale: 1.0 | grad norm: 0.357 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:40:05.742277 | finish at 2025-09-10 11:53:50 + [2025-09-10 01:13:49] iteration 5167/ 11920 | consumed samples: 5291008 | elapsed time per iteration (ms): 5664.2 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.243677E+00 | loss scale: 1.0 | grad norm: 0.317 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:37:30.291535 | finish at 2025-09-10 11:51:20 + [2025-09-10 01:13:55] iteration 5168/ 11920 | consumed samples: 5292032 | elapsed time per iteration (ms): 5660.0 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.235625E+00 | loss scale: 1.0 | grad norm: 0.372 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:36:56.489609 | finish at 2025-09-10 11:50:52 + [2025-09-10 01:14:01] iteration 5169/ 11920 | consumed samples: 5293056 | elapsed time per iteration (ms): 5675.1 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.222203E+00 | loss scale: 1.0 | grad norm: 0.728 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:38:32.463882 | finish at 2025-09-10 11:52:33 + [2025-09-10 01:14:06] iteration 5170/ 11920 | consumed samples: 5294080 | elapsed time per iteration (ms): 5674.4 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.239069E+00 | loss scale: 1.0 | grad norm: 0.506 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:38:22.261770 | finish at 2025-09-10 11:52:29 + [2025-09-10 01:14:12] iteration 5171/ 11920 | consumed samples: 5295104 | elapsed time per iteration (ms): 5670.2 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.222410E+00 | loss scale: 1.0 | grad norm: 0.376 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:37:48.516838 | finish at 2025-09-10 11:52:01 + [2025-09-10 01:14:18] iteration 5172/ 11920 | consumed samples: 5296128 | elapsed time per iteration (ms): 5664.2 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.220987E+00 | loss scale: 1.0 | grad norm: 0.757 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:37:01.816123 | finish at 2025-09-10 11:51:20 + [2025-09-10 01:14:24] iteration 5173/ 11920 | consumed samples: 5297152 | elapsed time per iteration (ms): 5677.0 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.262492E+00 | loss scale: 1.0 | grad norm: 2.301 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:38:22.606706 | finish at 2025-09-10 11:52:46 + [2025-09-10 01:14:29] iteration 5174/ 11920 | consumed samples: 5298176 | elapsed time per iteration (ms): 5665.7 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.224775E+00 | loss scale: 1.0 | grad norm: 0.372 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:37:00.964717 | finish at 2025-09-10 11:51:30 + [2025-09-10 01:14:35] iteration 5175/ 11920 | consumed samples: 5299200 | elapsed time per iteration (ms): 5671.3 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.215393E+00 | loss scale: 1.0 | grad norm: 0.416 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:37:32.734730 | finish at 2025-09-10 11:52:08 + [2025-09-10 01:14:41] iteration 5176/ 11920 | consumed samples: 5300224 | elapsed time per iteration (ms): 5675.0 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.210219E+00 | loss scale: 1.0 | grad norm: 0.675 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:37:52.111244 | finish at 2025-09-10 11:52:33 + [2025-09-10 01:14:46] iteration 5177/ 11920 | consumed samples: 5301248 | elapsed time per iteration (ms): 5655.5 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.210475E+00 | loss scale: 1.0 | grad norm: 0.783 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:35:35.010340 | finish at 2025-09-10 11:50:21 + [2025-09-10 01:14:52] iteration 5178/ 11920 | consumed samples: 5302272 | elapsed time per iteration (ms): 5648.1 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.203462E+00 | loss scale: 1.0 | grad norm: 0.543 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:34:39.452550 | finish at 2025-09-10 11:49:31 + [2025-09-10 01:14:57] iteration 5179/ 11920 | consumed samples: 5303296 | elapsed time per iteration (ms): 5669.1 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.224417E+00 | loss scale: 1.0 | grad norm: 0.783 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:36:55.427519 | finish at 2025-09-10 11:51:53 + [2025-09-10 01:15:03] iteration 5180/ 11920 | consumed samples: 5304320 | elapsed time per iteration (ms): 5669.2 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.183826E+00 | loss scale: 1.0 | grad norm: 0.522 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:36:50.569921 | finish at 2025-09-10 11:51:54 + [2025-09-10 01:15:09] iteration 5181/ 11920 | consumed samples: 5305344 | elapsed time per iteration (ms): 5666.3 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.196422E+00 | loss scale: 1.0 | grad norm: 0.928 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:36:25.305349 | finish at 2025-09-10 11:51:34 + [2025-09-10 01:15:14] iteration 5182/ 11920 | consumed samples: 5306368 | elapsed time per iteration (ms): 5652.7 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.203949E+00 | loss scale: 1.0 | grad norm: 0.367 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:34:47.776580 | finish at 2025-09-10 11:50:02 + [2025-09-10 01:15:20] iteration 5183/ 11920 | consumed samples: 5307392 | elapsed time per iteration (ms): 5666.4 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.181773E+00 | loss scale: 1.0 | grad norm: 0.404 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:36:14.245775 | finish at 2025-09-10 11:51:34 + [2025-09-10 01:15:26] iteration 5184/ 11920 | consumed samples: 5308416 | elapsed time per iteration (ms): 5658.0 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.174470E+00 | loss scale: 1.0 | grad norm: 0.397 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:35:12.148228 | finish at 2025-09-10 11:50:38 + [2025-09-10 01:15:31] iteration 5185/ 11920 | consumed samples: 5309440 | elapsed time per iteration (ms): 5649.8 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.158792E+00 | loss scale: 1.0 | grad norm: 0.372 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:34:11.519033 | finish at 2025-09-10 11:49:43 + [2025-09-10 01:15:37] iteration 5186/ 11920 | consumed samples: 5310464 | elapsed time per iteration (ms): 5655.9 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.155815E+00 | loss scale: 1.0 | grad norm: 0.330 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:34:46.893225 | finish at 2025-09-10 11:50:24 + [2025-09-10 01:15:43] iteration 5187/ 11920 | consumed samples: 5311488 | elapsed time per iteration (ms): 5869.1 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.162130E+00 | loss scale: 1.0 | grad norm: 0.266 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:58:36.321851 | finish at 2025-09-10 12:14:19 + [2025-09-10 01:15:49] iteration 5188/ 11920 | consumed samples: 5312512 | elapsed time per iteration (ms): 5984.8 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.164630E+00 | loss scale: 1.0 | grad norm: 0.277 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:11:29.531427 | finish at 2025-09-10 12:27:18 + [2025-09-10 01:15:55] iteration 5189/ 11920 | consumed samples: 5313536 | elapsed time per iteration (ms): 6019.4 | throughput per GPU (TFLOP/s/GPU): 75.0 | MFU 7.58% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.146960E+00 | loss scale: 1.0 | grad norm: 0.342 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:15:16.623931 | finish at 2025-09-10 12:31:12 + [2025-09-10 01:16:01] iteration 5190/ 11920 | consumed samples: 5314560 | elapsed time per iteration (ms): 5848.1 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.149901E+00 | loss scale: 1.0 | grad norm: 0.396 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:55:57.816939 | finish at 2025-09-10 12:11:59 + [2025-09-10 01:16:06] iteration 5191/ 11920 | consumed samples: 5315584 | elapsed time per iteration (ms): 5655.2 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.150294E+00 | loss scale: 1.0 | grad norm: 0.268 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:34:13.856874 | finish at 2025-09-10 11:50:20 + [2025-09-10 01:16:12] iteration 5192/ 11920 | consumed samples: 5316608 | elapsed time per iteration (ms): 5954.7 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.150990E+00 | loss scale: 1.0 | grad norm: 0.287 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:07:43.268072 | finish at 2025-09-10 12:23:56 + [2025-09-10 01:16:18] iteration 5193/ 11920 | consumed samples: 5317632 | elapsed time per iteration (ms): 5655.0 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.124321E+00 | loss scale: 1.0 | grad norm: 0.384 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:34:01.128673 | finish at 2025-09-10 11:50:19 + [2025-09-10 01:16:24] iteration 5194/ 11920 | consumed samples: 5318656 | elapsed time per iteration (ms): 5895.8 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.132250E+00 | loss scale: 1.0 | grad norm: 0.610 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:00:55.384087 | finish at 2025-09-10 12:17:19 + [2025-09-10 01:16:30] iteration 5195/ 11920 | consumed samples: 5319680 | elapsed time per iteration (ms): 5652.3 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.140060E+00 | loss scale: 1.0 | grad norm: 0.348 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:33:32.016529 | finish at 2025-09-10 11:50:02 + [2025-09-10 01:16:36] iteration 5196/ 11920 | consumed samples: 5320704 | elapsed time per iteration (ms): 5959.7 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.132635E+00 | loss scale: 1.0 | grad norm: 0.434 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:07:52.734961 | finish at 2025-09-10 12:24:28 + [2025-09-10 01:16:41] iteration 5197/ 11920 | consumed samples: 5321728 | elapsed time per iteration (ms): 5662.2 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.144530E+00 | loss scale: 1.0 | grad norm: 0.502 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:34:26.680303 | finish at 2025-09-10 11:51:08 + [2025-09-10 01:16:47] iteration 5198/ 11920 | consumed samples: 5322752 | elapsed time per iteration (ms): 5646.5 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.128718E+00 | loss scale: 1.0 | grad norm: 0.746 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:32:35.698419 | finish at 2025-09-10 11:49:23 + [2025-09-10 01:16:53] iteration 5199/ 11920 | consumed samples: 5323776 | elapsed time per iteration (ms): 5642.7 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.121502E+00 | loss scale: 1.0 | grad norm: 0.341 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:32:04.302783 | finish at 2025-09-10 11:48:57 + [2025-09-10 01:16:59] iteration 5200/ 11920 | consumed samples: 5324800 | elapsed time per iteration (ms): 6032.2 | throughput per GPU (TFLOP/s/GPU): 74.8 | MFU 7.57% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.130603E+00 | loss scale: 1.0 | grad norm: 0.392 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:15:36.136322 | finish at 2025-09-10 12:32:35 + [2025-09-10 01:17:04] iteration 5201/ 11920 | consumed samples: 5325824 | elapsed time per iteration (ms): 5652.6 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.113351E+00 | loss scale: 1.0 | grad norm: 0.307 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:32:59.624300 | finish at 2025-09-10 11:50:04 + [2025-09-10 01:17:10] iteration 5202/ 11920 | consumed samples: 5326848 | elapsed time per iteration (ms): 5644.7 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.124138E+00 | loss scale: 1.0 | grad norm: 0.506 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:32:00.838667 | finish at 2025-09-10 11:49:11 + [2025-09-10 01:17:16] iteration 5203/ 11920 | consumed samples: 5327872 | elapsed time per iteration (ms): 5892.6 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.159798E+00 | loss scale: 1.0 | grad norm: 2.370 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:59:40.727521 | finish at 2025-09-10 12:16:57 + [2025-09-10 01:17:21] iteration 5204/ 11920 | consumed samples: 5328896 | elapsed time per iteration (ms): 5652.9 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.133065E+00 | loss scale: 1.0 | grad norm: 0.395 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:32:44.970741 | finish at 2025-09-10 11:50:06 + [2025-09-10 01:17:28] iteration 5205/ 11920 | consumed samples: 5329920 | elapsed time per iteration (ms): 6237.9 | throughput per GPU (TFLOP/s/GPU): 72.4 | MFU 7.32% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.107211E+00 | loss scale: 1.0 | grad norm: 0.358 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:38:07.581877 | finish at 2025-09-10 12:55:35 + [2025-09-10 01:17:33] iteration 5206/ 11920 | consumed samples: 5330944 | elapsed time per iteration (ms): 5663.9 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.117330E+00 | loss scale: 1.0 | grad norm: 0.394 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:33:47.449531 | finish at 2025-09-10 11:51:21 + [2025-09-10 01:17:39] iteration 5207/ 11920 | consumed samples: 5331968 | elapsed time per iteration (ms): 5976.0 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.109819E+00 | loss scale: 1.0 | grad norm: 0.478 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:08:36.787642 | finish at 2025-09-10 12:26:16 + [2025-09-10 01:17:45] iteration 5208/ 11920 | consumed samples: 5332992 | elapsed time per iteration (ms): 5872.0 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.128826E+00 | loss scale: 1.0 | grad norm: 0.786 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:56:52.947073 | finish at 2025-09-10 12:14:38 + [2025-09-10 01:17:51] iteration 5209/ 11920 | consumed samples: 5334016 | elapsed time per iteration (ms): 5658.6 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.146492E+00 | loss scale: 1.0 | grad norm: 4.032 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:32:54.538811 | finish at 2025-09-10 11:50:45 + [2025-09-10 01:17:56] iteration 5210/ 11920 | consumed samples: 5335040 | elapsed time per iteration (ms): 5642.9 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.137741E+00 | loss scale: 1.0 | grad norm: 0.971 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:31:03.942122 | finish at 2025-09-10 11:49:00 + [2025-09-10 01:18:02] iteration 5211/ 11920 | consumed samples: 5336064 | elapsed time per iteration (ms): 5648.7 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.125058E+00 | loss scale: 1.0 | grad norm: 0.290 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:31:37.046715 | finish at 2025-09-10 11:49:39 + [2025-09-10 01:18:08] iteration 5212/ 11920 | consumed samples: 5337088 | elapsed time per iteration (ms): 5644.5 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.115130E+00 | loss scale: 1.0 | grad norm: 0.278 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:31:03.216554 | finish at 2025-09-10 11:49:11 + [2025-09-10 01:18:14] iteration 5213/ 11920 | consumed samples: 5338112 | elapsed time per iteration (ms): 5856.1 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.102766E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:54:36.696949 | finish at 2025-09-10 12:12:50 + [2025-09-10 01:18:20] iteration 5214/ 11920 | consumed samples: 5339136 | elapsed time per iteration (ms): 6001.3 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.097044E+00 | loss scale: 1.0 | grad norm: 0.238 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:10:44.739232 | finish at 2025-09-10 12:29:04 + [2025-09-10 01:18:25] iteration 5215/ 11920 | consumed samples: 5340160 | elapsed time per iteration (ms): 5674.3 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.084758E+00 | loss scale: 1.0 | grad norm: 0.339 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:34:06.299497 | finish at 2025-09-10 11:52:32 + [2025-09-10 01:18:31] iteration 5216/ 11920 | consumed samples: 5341184 | elapsed time per iteration (ms): 5659.4 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.093481E+00 | loss scale: 1.0 | grad norm: 0.547 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:32:20.723000 | finish at 2025-09-10 11:50:52 + [2025-09-10 01:18:37] iteration 5217/ 11920 | consumed samples: 5342208 | elapsed time per iteration (ms): 5660.1 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.093296E+00 | loss scale: 1.0 | grad norm: 0.511 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:32:19.396086 | finish at 2025-09-10 11:50:56 + [2025-09-10 01:18:42] iteration 5218/ 11920 | consumed samples: 5343232 | elapsed time per iteration (ms): 5862.9 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.096895E+00 | loss scale: 1.0 | grad norm: 0.284 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:54:53.278962 | finish at 2025-09-10 12:13:36 + [2025-09-10 01:18:49] iteration 5219/ 11920 | consumed samples: 5344256 | elapsed time per iteration (ms): 6024.1 | throughput per GPU (TFLOP/s/GPU): 74.9 | MFU 7.58% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.091109E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:12:47.173411 | finish at 2025-09-10 12:31:36 + [2025-09-10 01:18:54] iteration 5220/ 11920 | consumed samples: 5345280 | elapsed time per iteration (ms): 5642.6 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.079472E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:30:05.444360 | finish at 2025-09-10 11:49:00 + [2025-09-10 01:19:00] iteration 5221/ 11920 | consumed samples: 5346304 | elapsed time per iteration (ms): 5878.3 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.086867E+00 | loss scale: 1.0 | grad norm: 0.338 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:56:19.063586 | finish at 2025-09-10 12:15:19 + [2025-09-10 01:19:06] iteration 5222/ 11920 | consumed samples: 5347328 | elapsed time per iteration (ms): 6262.3 | throughput per GPU (TFLOP/s/GPU): 72.1 | MFU 7.29% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.080639E+00 | loss scale: 1.0 | grad norm: 0.885 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:39:05.069144 | finish at 2025-09-10 12:58:11 + [2025-09-10 01:19:12] iteration 5223/ 11920 | consumed samples: 5348352 | elapsed time per iteration (ms): 5998.2 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.084020E+00 | loss scale: 1.0 | grad norm: 0.319 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:09:29.834825 | finish at 2025-09-10 12:28:42 + [2025-09-10 01:19:18] iteration 5224/ 11920 | consumed samples: 5349376 | elapsed time per iteration (ms): 5651.2 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.060209E+00 | loss scale: 1.0 | grad norm: 0.313 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:30:40.577660 | finish at 2025-09-10 11:49:59 + [2025-09-10 01:19:24] iteration 5225/ 11920 | consumed samples: 5350400 | elapsed time per iteration (ms): 5644.4 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.079005E+00 | loss scale: 1.0 | grad norm: 0.457 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:29:49.131105 | finish at 2025-09-10 11:49:13 + [2025-09-10 01:19:29] iteration 5226/ 11920 | consumed samples: 5351424 | elapsed time per iteration (ms): 5642.8 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.081380E+00 | loss scale: 1.0 | grad norm: 0.585 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:29:32.675596 | finish at 2025-09-10 11:49:02 + [2025-09-10 01:19:35] iteration 5227/ 11920 | consumed samples: 5352448 | elapsed time per iteration (ms): 5641.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.077685E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:29:15.632895 | finish at 2025-09-10 11:48:51 + [2025-09-10 01:19:41] iteration 5228/ 11920 | consumed samples: 5353472 | elapsed time per iteration (ms): 5976.0 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.068905E+00 | loss scale: 1.0 | grad norm: 0.318 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:06:31.129215 | finish at 2025-09-10 12:26:12 + [2025-09-10 01:19:47] iteration 5229/ 11920 | consumed samples: 5354496 | elapsed time per iteration (ms): 5642.0 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.055750E+00 | loss scale: 1.0 | grad norm: 0.544 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:29:10.876973 | finish at 2025-09-10 11:48:57 + [2025-09-10 01:19:52] iteration 5230/ 11920 | consumed samples: 5355520 | elapsed time per iteration (ms): 5648.8 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.079925E+00 | loss scale: 1.0 | grad norm: 1.824 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:29:50.586147 | finish at 2025-09-10 11:49:43 + [2025-09-10 01:19:58] iteration 5231/ 11920 | consumed samples: 5356544 | elapsed time per iteration (ms): 5970.1 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.072202E+00 | loss scale: 1.0 | grad norm: 0.360 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:05:33.716129 | finish at 2025-09-10 12:25:32 + [2025-09-10 01:20:04] iteration 5232/ 11920 | consumed samples: 5357568 | elapsed time per iteration (ms): 5641.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.064143E+00 | loss scale: 1.0 | grad norm: 0.391 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:28:48.991829 | finish at 2025-09-10 11:48:53 + [2025-09-10 01:20:09] iteration 5233/ 11920 | consumed samples: 5358592 | elapsed time per iteration (ms): 5639.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.073855E+00 | loss scale: 1.0 | grad norm: 0.305 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:28:30.596092 | finish at 2025-09-10 11:48:40 + [2025-09-10 01:20:15] iteration 5234/ 11920 | consumed samples: 5359616 | elapsed time per iteration (ms): 5639.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.051354E+00 | loss scale: 1.0 | grad norm: 0.246 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:28:24.151699 | finish at 2025-09-10 11:48:39 + [2025-09-10 01:20:21] iteration 5235/ 11920 | consumed samples: 5360640 | elapsed time per iteration (ms): 5900.8 | throughput per GPU (TFLOP/s/GPU): 76.5 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.059581E+00 | loss scale: 1.0 | grad norm: 0.481 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:57:26.557854 | finish at 2025-09-10 12:17:47 + [2025-09-10 01:20:27] iteration 5236/ 11920 | consumed samples: 5361664 | elapsed time per iteration (ms): 5638.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.072038E+00 | loss scale: 1.0 | grad norm: 1.092 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:28:09.630206 | finish at 2025-09-10 11:48:36 + [2025-09-10 01:20:32] iteration 5237/ 11920 | consumed samples: 5362688 | elapsed time per iteration (ms): 5637.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.056946E+00 | loss scale: 1.0 | grad norm: 0.291 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:27:56.389543 | finish at 2025-09-10 11:48:29 + [2025-09-10 01:20:38] iteration 5238/ 11920 | consumed samples: 5363712 | elapsed time per iteration (ms): 5859.4 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.056543E+00 | loss scale: 1.0 | grad norm: 0.331 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:52:32.401102 | finish at 2025-09-10 12:13:10 + [2025-09-10 01:20:44] iteration 5239/ 11920 | consumed samples: 5364736 | elapsed time per iteration (ms): 5642.2 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.048631E+00 | loss scale: 1.0 | grad norm: 0.291 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:28:15.232322 | finish at 2025-09-10 11:48:59 + [2025-09-10 01:20:49] iteration 5240/ 11920 | consumed samples: 5365760 | elapsed time per iteration (ms): 5642.4 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.050794E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:28:11.483812 | finish at 2025-09-10 11:49:01 + [2025-09-10 01:20:55] iteration 5241/ 11920 | consumed samples: 5366784 | elapsed time per iteration (ms): 5633.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.049307E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:27:06.067543 | finish at 2025-09-10 11:48:01 + [2025-09-10 01:21:01] iteration 5242/ 11920 | consumed samples: 5367808 | elapsed time per iteration (ms): 5639.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.054887E+00 | loss scale: 1.0 | grad norm: 0.230 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:27:38.424567 | finish at 2025-09-10 11:48:39 + [2025-09-10 01:21:06] iteration 5243/ 11920 | consumed samples: 5368832 | elapsed time per iteration (ms): 5644.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.050934E+00 | loss scale: 1.0 | grad norm: 0.252 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:28:06.678977 | finish at 2025-09-10 11:49:13 + [2025-09-10 01:21:12] iteration 5244/ 11920 | consumed samples: 5369856 | elapsed time per iteration (ms): 5641.7 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.035729E+00 | loss scale: 1.0 | grad norm: 0.238 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:27:43.924137 | finish at 2025-09-10 11:48:56 + [2025-09-10 01:21:18] iteration 5245/ 11920 | consumed samples: 5370880 | elapsed time per iteration (ms): 5633.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.039455E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:26:44.190856 | finish at 2025-09-10 11:48:02 + [2025-09-10 01:21:23] iteration 5246/ 11920 | consumed samples: 5371904 | elapsed time per iteration (ms): 5862.7 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.041586E+00 | loss scale: 1.0 | grad norm: 0.264 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:52:07.763132 | finish at 2025-09-10 12:13:31 + [2025-09-10 01:21:29] iteration 5247/ 11920 | consumed samples: 5372928 | elapsed time per iteration (ms): 5635.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.036781E+00 | loss scale: 1.0 | grad norm: 0.404 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:26:42.595172 | finish at 2025-09-10 11:48:12 + [2025-09-10 01:21:35] iteration 5248/ 11920 | consumed samples: 5373952 | elapsed time per iteration (ms): 5645.6 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.030620E+00 | loss scale: 1.0 | grad norm: 0.558 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:27:47.265575 | finish at 2025-09-10 11:49:22 + [2025-09-10 01:21:40] iteration 5249/ 11920 | consumed samples: 5374976 | elapsed time per iteration (ms): 5644.6 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.034662E+00 | loss scale: 1.0 | grad norm: 0.246 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:27:35.210326 | finish at 2025-09-10 11:49:16 + [2025-09-10 01:21:46] iteration 5250/ 11920 | consumed samples: 5376000 | elapsed time per iteration (ms): 5634.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.021165E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:26:22.081783 | finish at 2025-09-10 11:48:08 + [2025-09-10 01:21:52] iteration 5251/ 11920 | consumed samples: 5377024 | elapsed time per iteration (ms): 5635.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.012738E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:26:23.653229 | finish at 2025-09-10 11:48:15 + [2025-09-10 01:21:57] iteration 5252/ 11920 | consumed samples: 5378048 | elapsed time per iteration (ms): 5633.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.026979E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:26:01.660458 | finish at 2025-09-10 11:47:59 + [2025-09-10 01:22:03] iteration 5253/ 11920 | consumed samples: 5379072 | elapsed time per iteration (ms): 5631.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.023963E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:25:45.477580 | finish at 2025-09-10 11:47:48 + [2025-09-10 01:22:09] iteration 5254/ 11920 | consumed samples: 5380096 | elapsed time per iteration (ms): 5626.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.007360E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:25:08.799099 | finish at 2025-09-10 11:47:17 + [2025-09-10 01:22:14] iteration 5255/ 11920 | consumed samples: 5381120 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.014536E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:24:52.918013 | finish at 2025-09-10 11:47:07 + [2025-09-10 01:22:20] iteration 5256/ 11920 | consumed samples: 5382144 | elapsed time per iteration (ms): 5629.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.023260E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:25:15.476772 | finish at 2025-09-10 11:47:35 + [2025-09-10 01:22:25] iteration 5257/ 11920 | consumed samples: 5383168 | elapsed time per iteration (ms): 5629.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.003308E+00 | loss scale: 1.0 | grad norm: 0.241 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:25:10.172858 | finish at 2025-09-10 11:47:36 + [2025-09-10 01:22:31] iteration 5258/ 11920 | consumed samples: 5384192 | elapsed time per iteration (ms): 5843.3 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.015398E+00 | loss scale: 1.0 | grad norm: 0.315 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:48:48.201891 | finish at 2025-09-10 12:11:19 + [2025-09-10 01:22:37] iteration 5259/ 11920 | consumed samples: 5385216 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.028244E+00 | loss scale: 1.0 | grad norm: 0.295 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:24:44.896989 | finish at 2025-09-10 11:47:22 + [2025-09-10 01:22:43] iteration 5260/ 11920 | consumed samples: 5386240 | elapsed time per iteration (ms): 5947.2 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.009603E+00 | loss scale: 1.0 | grad norm: 0.257 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:00:08.134861 | finish at 2025-09-10 12:22:51 + [2025-09-10 01:22:48] iteration 5261/ 11920 | consumed samples: 5387264 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.004576E+00 | loss scale: 1.0 | grad norm: 0.288 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:24:08.100173 | finish at 2025-09-10 11:46:57 + [2025-09-10 01:22:54] iteration 5262/ 11920 | consumed samples: 5388288 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.025687E+00 | loss scale: 1.0 | grad norm: 0.452 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:23:49.567828 | finish at 2025-09-10 11:46:44 + [2025-09-10 01:23:00] iteration 5263/ 11920 | consumed samples: 5389312 | elapsed time per iteration (ms): 5843.6 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.026648E+00 | loss scale: 1.0 | grad norm: 0.977 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:48:20.672431 | finish at 2025-09-10 12:11:21 + [2025-09-10 01:23:06] iteration 5264/ 11920 | consumed samples: 5390336 | elapsed time per iteration (ms): 5635.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.009742E+00 | loss scale: 1.0 | grad norm: 0.296 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:25:11.985596 | finish at 2025-09-10 11:48:18 + [2025-09-10 01:23:11] iteration 5265/ 11920 | consumed samples: 5391360 | elapsed time per iteration (ms): 5622.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.021400E+00 | loss scale: 1.0 | grad norm: 0.446 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:23:38.633591 | finish at 2025-09-10 11:46:50 + [2025-09-10 01:23:17] iteration 5266/ 11920 | consumed samples: 5392384 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.039702E+00 | loss scale: 1.0 | grad norm: 0.643 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:23:37.340343 | finish at 2025-09-10 11:46:54 + [2025-09-10 01:23:22] iteration 5267/ 11920 | consumed samples: 5393408 | elapsed time per iteration (ms): 5629.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.032899E+00 | loss scale: 1.0 | grad norm: 0.313 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:24:13.207260 | finish at 2025-09-10 11:47:36 + [2025-09-10 01:23:28] iteration 5268/ 11920 | consumed samples: 5394432 | elapsed time per iteration (ms): 5622.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.012414E+00 | loss scale: 1.0 | grad norm: 0.282 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:23:21.756171 | finish at 2025-09-10 11:46:50 + [2025-09-10 01:23:34] iteration 5269/ 11920 | consumed samples: 5395456 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.011751E+00 | loss scale: 1.0 | grad norm: 0.255 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:23:30.761823 | finish at 2025-09-10 11:47:04 + [2025-09-10 01:23:39] iteration 5270/ 11920 | consumed samples: 5396480 | elapsed time per iteration (ms): 5640.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.007241E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:25:06.053019 | finish at 2025-09-10 11:48:45 + [2025-09-10 01:23:45] iteration 5271/ 11920 | consumed samples: 5397504 | elapsed time per iteration (ms): 5636.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.013405E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:24:37.284284 | finish at 2025-09-10 11:48:22 + [2025-09-10 01:23:51] iteration 5272/ 11920 | consumed samples: 5398528 | elapsed time per iteration (ms): 5629.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.995821E+00 | loss scale: 1.0 | grad norm: 0.266 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:23:42.333441 | finish at 2025-09-10 11:47:33 + [2025-09-10 01:23:56] iteration 5273/ 11920 | consumed samples: 5399552 | elapsed time per iteration (ms): 5633.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.998208E+00 | loss scale: 1.0 | grad norm: 0.436 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:24:06.678637 | finish at 2025-09-10 11:48:03 + [2025-09-10 01:24:02] iteration 5274/ 11920 | consumed samples: 5400576 | elapsed time per iteration (ms): 5641.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.025502E+00 | loss scale: 1.0 | grad norm: 0.729 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:24:50.819853 | finish at 2025-09-10 11:48:53 + [2025-09-10 01:24:07] iteration 5275/ 11920 | consumed samples: 5401600 | elapsed time per iteration (ms): 5636.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.007043E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:24:14.205844 | finish at 2025-09-10 11:48:22 + [2025-09-10 01:24:13] iteration 5276/ 11920 | consumed samples: 5402624 | elapsed time per iteration (ms): 5630.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.006298E+00 | loss scale: 1.0 | grad norm: 0.246 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:23:30.004040 | finish at 2025-09-10 11:47:43 + [2025-09-10 01:24:19] iteration 5277/ 11920 | consumed samples: 5403648 | elapsed time per iteration (ms): 5630.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.006900E+00 | loss scale: 1.0 | grad norm: 0.269 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:23:21.832956 | finish at 2025-09-10 11:47:41 + [2025-09-10 01:24:24] iteration 5278/ 11920 | consumed samples: 5404672 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.003800E+00 | loss scale: 1.0 | grad norm: 0.269 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:22:41.501789 | finish at 2025-09-10 11:47:06 + [2025-09-10 01:24:30] iteration 5279/ 11920 | consumed samples: 5405696 | elapsed time per iteration (ms): 5632.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.003972E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:23:24.700555 | finish at 2025-09-10 11:47:55 + [2025-09-10 01:24:36] iteration 5280/ 11920 | consumed samples: 5406720 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.006174E+00 | loss scale: 1.0 | grad norm: 0.137 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:22:28.503971 | finish at 2025-09-10 11:47:04 + [2025-09-10 01:24:41] iteration 5281/ 11920 | consumed samples: 5407744 | elapsed time per iteration (ms): 5638.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.996568E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:23:51.991102 | finish at 2025-09-10 11:48:33 + [2025-09-10 01:24:47] iteration 5282/ 11920 | consumed samples: 5408768 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.990700E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:22:17.697556 | finish at 2025-09-10 11:47:05 + [2025-09-10 01:24:53] iteration 5283/ 11920 | consumed samples: 5409792 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.995330E+00 | loss scale: 1.0 | grad norm: 0.270 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:21:55.658644 | finish at 2025-09-10 11:46:48 + [2025-09-10 01:24:58] iteration 5284/ 11920 | consumed samples: 5410816 | elapsed time per iteration (ms): 5632.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.995215E+00 | loss scale: 1.0 | grad norm: 0.466 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:22:55.711143 | finish at 2025-09-10 11:47:54 + [2025-09-10 01:25:04] iteration 5285/ 11920 | consumed samples: 5411840 | elapsed time per iteration (ms): 5633.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.007456E+00 | loss scale: 1.0 | grad norm: 0.564 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:22:59.359928 | finish at 2025-09-10 11:48:03 + [2025-09-10 01:25:09] iteration 5286/ 11920 | consumed samples: 5412864 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.989444E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:21:27.623375 | finish at 2025-09-10 11:46:37 + [2025-09-10 01:25:15] iteration 5287/ 11920 | consumed samples: 5413888 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.983652E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:21:27.513976 | finish at 2025-09-10 11:46:43 + [2025-09-10 01:25:21] iteration 5288/ 11920 | consumed samples: 5414912 | elapsed time per iteration (ms): 5627.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.997478E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:21:59.437864 | finish at 2025-09-10 11:47:20 + [2025-09-10 01:25:27] iteration 5289/ 11920 | consumed samples: 5415936 | elapsed time per iteration (ms): 5966.0 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.988644E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:59:20.331522 | finish at 2025-09-10 12:24:47 + [2025-09-10 01:25:32] iteration 5290/ 11920 | consumed samples: 5416960 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.983309E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:21:04.102106 | finish at 2025-09-10 11:46:36 + [2025-09-10 01:25:38] iteration 5291/ 11920 | consumed samples: 5417984 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.001365E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:21:56.660508 | finish at 2025-09-10 11:47:35 + [2025-09-10 01:25:43] iteration 5292/ 11920 | consumed samples: 5419008 | elapsed time per iteration (ms): 5638.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.982970E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:22:52.075810 | finish at 2025-09-10 11:48:36 + [2025-09-10 01:25:49] iteration 5293/ 11920 | consumed samples: 5420032 | elapsed time per iteration (ms): 5623.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.983341E+00 | loss scale: 1.0 | grad norm: 0.270 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:21:03.370741 | finish at 2025-09-10 11:46:52 + [2025-09-10 01:25:55] iteration 5294/ 11920 | consumed samples: 5421056 | elapsed time per iteration (ms): 5626.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.987752E+00 | loss scale: 1.0 | grad norm: 0.343 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:21:21.020825 | finish at 2025-09-10 11:47:16 + [2025-09-10 01:26:00] iteration 5295/ 11920 | consumed samples: 5422080 | elapsed time per iteration (ms): 5638.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.989119E+00 | loss scale: 1.0 | grad norm: 0.296 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:22:37.494801 | finish at 2025-09-10 11:48:38 + [2025-09-10 01:26:06] iteration 5296/ 11920 | consumed samples: 5423104 | elapsed time per iteration (ms): 5631.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.992183E+00 | loss scale: 1.0 | grad norm: 0.302 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:21:45.511826 | finish at 2025-09-10 11:47:52 + [2025-09-10 01:26:12] iteration 5297/ 11920 | consumed samples: 5424128 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.002360E+00 | loss scale: 1.0 | grad norm: 0.937 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:21:01.460181 | finish at 2025-09-10 11:47:13 + [2025-09-10 01:26:17] iteration 5298/ 11920 | consumed samples: 5425152 | elapsed time per iteration (ms): 5832.2 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.002108E+00 | loss scale: 1.0 | grad norm: 0.553 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:43:40.734005 | finish at 2025-09-10 12:09:58 + [2025-09-10 01:26:23] iteration 5299/ 11920 | consumed samples: 5426176 | elapsed time per iteration (ms): 5628.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.999565E+00 | loss scale: 1.0 | grad norm: 0.387 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:21:08.058504 | finish at 2025-09-10 11:47:31 + [2025-09-10 01:26:29] iteration 5300/ 11920 | consumed samples: 5427200 | elapsed time per iteration (ms): 5618.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.000060E+00 | loss scale: 1.0 | grad norm: 0.316 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:19:57.197318 | finish at 2025-09-10 11:46:26 + [2025-09-10 01:26:35] iteration 5301/ 11920 | consumed samples: 5428224 | elapsed time per iteration (ms): 5833.5 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.989974E+00 | loss scale: 1.0 | grad norm: 0.286 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:43:31.760726 | finish at 2025-09-10 12:10:06 + [2025-09-10 01:26:40] iteration 5302/ 11920 | consumed samples: 5429248 | elapsed time per iteration (ms): 5642.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.989683E+00 | loss scale: 1.0 | grad norm: 0.339 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:22:20.472440 | finish at 2025-09-10 11:49:01 + [2025-09-10 01:26:46] iteration 5303/ 11920 | consumed samples: 5430272 | elapsed time per iteration (ms): 5632.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.992016E+00 | loss scale: 1.0 | grad norm: 0.326 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:21:12.672121 | finish at 2025-09-10 11:47:59 + [2025-09-10 01:26:51] iteration 5304/ 11920 | consumed samples: 5431296 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.987564E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:19:55.380581 | finish at 2025-09-10 11:46:47 + [2025-09-10 01:26:57] iteration 5305/ 11920 | consumed samples: 5432320 | elapsed time per iteration (ms): 5875.5 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.983185E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:47:46.500077 | finish at 2025-09-10 12:14:44 + [2025-09-10 01:27:03] iteration 5306/ 11920 | consumed samples: 5433344 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.990658E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:20:32.430666 | finish at 2025-09-10 11:47:35 + [2025-09-10 01:27:09] iteration 5307/ 11920 | consumed samples: 5434368 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.982184E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:19:56.901510 | finish at 2025-09-10 11:47:05 + [2025-09-10 01:27:14] iteration 5308/ 11920 | consumed samples: 5435392 | elapsed time per iteration (ms): 5841.9 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.971881E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:43:46.571013 | finish at 2025-09-10 12:11:01 + [2025-09-10 01:27:20] iteration 5309/ 11920 | consumed samples: 5436416 | elapsed time per iteration (ms): 5626.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.972349E+00 | loss scale: 1.0 | grad norm: 0.292 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:19:57.306194 | finish at 2025-09-10 11:47:17 + [2025-09-10 01:27:26] iteration 5310/ 11920 | consumed samples: 5437440 | elapsed time per iteration (ms): 5629.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.986174E+00 | loss scale: 1.0 | grad norm: 0.292 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:20:13.796453 | finish at 2025-09-10 11:47:39 + [2025-09-10 01:27:31] iteration 5311/ 11920 | consumed samples: 5438464 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.983474E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:19:35.352402 | finish at 2025-09-10 11:47:07 + [2025-09-10 01:27:37] iteration 5312/ 11920 | consumed samples: 5439488 | elapsed time per iteration (ms): 5628.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.979183E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:19:54.802624 | finish at 2025-09-10 11:47:32 + [2025-09-10 01:27:43] iteration 5313/ 11920 | consumed samples: 5440512 | elapsed time per iteration (ms): 5637.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.963209E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:20:45.335601 | finish at 2025-09-10 11:48:28 + [2025-09-10 01:27:48] iteration 5314/ 11920 | consumed samples: 5441536 | elapsed time per iteration (ms): 5617.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.969435E+00 | loss scale: 1.0 | grad norm: 0.240 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:18:31.780555 | finish at 2025-09-10 11:46:20 + [2025-09-10 01:27:54] iteration 5315/ 11920 | consumed samples: 5442560 | elapsed time per iteration (ms): 5851.1 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.973212E+00 | loss scale: 1.0 | grad norm: 0.276 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:44:06.726305 | finish at 2025-09-10 12:12:01 + [2025-09-10 01:28:00] iteration 5316/ 11920 | consumed samples: 5443584 | elapsed time per iteration (ms): 5632.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.965373E+00 | loss scale: 1.0 | grad norm: 0.252 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:19:59.937691 | finish at 2025-09-10 11:48:00 + [2025-09-10 01:28:05] iteration 5317/ 11920 | consumed samples: 5444608 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.968655E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:18:33.777288 | finish at 2025-09-10 11:46:39 + [2025-09-10 01:28:11] iteration 5318/ 11920 | consumed samples: 5445632 | elapsed time per iteration (ms): 5618.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.968877E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:18:15.603579 | finish at 2025-09-10 11:46:27 + [2025-09-10 01:28:17] iteration 5319/ 11920 | consumed samples: 5446656 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.985299E+00 | loss scale: 1.0 | grad norm: 0.399 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:18:44.057528 | finish at 2025-09-10 11:47:01 + [2025-09-10 01:28:22] iteration 5320/ 11920 | consumed samples: 5447680 | elapsed time per iteration (ms): 5639.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.987102E+00 | loss scale: 1.0 | grad norm: 0.507 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:20:17.528629 | finish at 2025-09-10 11:48:40 + [2025-09-10 01:28:28] iteration 5321/ 11920 | consumed samples: 5448704 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.987887E+00 | loss scale: 1.0 | grad norm: 0.316 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:18:06.472071 | finish at 2025-09-10 11:46:34 + [2025-09-10 01:28:33] iteration 5322/ 11920 | consumed samples: 5449728 | elapsed time per iteration (ms): 5632.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.973491E+00 | loss scale: 1.0 | grad norm: 0.314 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:19:25.229232 | finish at 2025-09-10 11:47:59 + [2025-09-10 01:28:39] iteration 5323/ 11920 | consumed samples: 5450752 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.973685E+00 | loss scale: 1.0 | grad norm: 0.329 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:18:12.913993 | finish at 2025-09-10 11:46:52 + [2025-09-10 01:28:45] iteration 5324/ 11920 | consumed samples: 5451776 | elapsed time per iteration (ms): 5627.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.981920E+00 | loss scale: 1.0 | grad norm: 0.246 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:18:39.370949 | finish at 2025-09-10 11:47:24 + [2025-09-10 01:28:51] iteration 5325/ 11920 | consumed samples: 5452800 | elapsed time per iteration (ms): 5976.3 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.973311E+00 | loss scale: 1.0 | grad norm: 0.244 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:56:53.602593 | finish at 2025-09-10 12:25:44 + [2025-09-10 01:28:56] iteration 5326/ 11920 | consumed samples: 5453824 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.972194E+00 | loss scale: 1.0 | grad norm: 0.238 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:17:59.205896 | finish at 2025-09-10 11:46:55 + [2025-09-10 01:29:02] iteration 5327/ 11920 | consumed samples: 5454848 | elapsed time per iteration (ms): 5630.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.977522E+00 | loss scale: 1.0 | grad norm: 0.293 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:18:38.930284 | finish at 2025-09-10 11:47:41 + [2025-09-10 01:29:08] iteration 5328/ 11920 | consumed samples: 5455872 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.975558E+00 | loss scale: 1.0 | grad norm: 0.310 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:18:04.048584 | finish at 2025-09-10 11:47:12 + [2025-09-10 01:29:13] iteration 5329/ 11920 | consumed samples: 5456896 | elapsed time per iteration (ms): 5628.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.980176E+00 | loss scale: 1.0 | grad norm: 0.267 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:18:15.367558 | finish at 2025-09-10 11:47:29 + [2025-09-10 01:29:19] iteration 5330/ 11920 | consumed samples: 5457920 | elapsed time per iteration (ms): 5622.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.986045E+00 | loss scale: 1.0 | grad norm: 0.303 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:17:32.631280 | finish at 2025-09-10 11:46:51 + [2025-09-10 01:29:24] iteration 5331/ 11920 | consumed samples: 5458944 | elapsed time per iteration (ms): 5632.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.973546E+00 | loss scale: 1.0 | grad norm: 0.280 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:18:31.979664 | finish at 2025-09-10 11:47:56 + [2025-09-10 01:29:31] iteration 5332/ 11920 | consumed samples: 5459968 | elapsed time per iteration (ms): 6228.8 | throughput per GPU (TFLOP/s/GPU): 72.5 | MFU 7.33% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.975271E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:23:55.333214 | finish at 2025-09-10 12:53:26 + [2025-09-10 01:29:37] iteration 5333/ 11920 | consumed samples: 5460992 | elapsed time per iteration (ms): 5944.4 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.974039E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:52:35.551262 | finish at 2025-09-10 12:22:12 + [2025-09-10 01:29:42] iteration 5334/ 11920 | consumed samples: 5462016 | elapsed time per iteration (ms): 5889.0 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.976676E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:46:25.032109 | finish at 2025-09-10 12:16:08 + [2025-09-10 01:29:48] iteration 5335/ 11920 | consumed samples: 5463040 | elapsed time per iteration (ms): 5827.5 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.962012E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:39:34.149420 | finish at 2025-09-10 12:09:22 + [2025-09-10 01:29:54] iteration 5336/ 11920 | consumed samples: 5464064 | elapsed time per iteration (ms): 5638.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.956900E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:18:42.715944 | finish at 2025-09-10 11:48:37 + [2025-09-10 01:30:00] iteration 5337/ 11920 | consumed samples: 5465088 | elapsed time per iteration (ms): 5632.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.963837E+00 | loss scale: 1.0 | grad norm: 0.264 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:17:59.701322 | finish at 2025-09-10 11:47:59 + [2025-09-10 01:30:05] iteration 5338/ 11920 | consumed samples: 5466112 | elapsed time per iteration (ms): 5630.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.971054E+00 | loss scale: 1.0 | grad norm: 0.357 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:17:37.608593 | finish at 2025-09-10 11:47:43 + [2025-09-10 01:30:11] iteration 5339/ 11920 | consumed samples: 5467136 | elapsed time per iteration (ms): 5634.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.967937E+00 | loss scale: 1.0 | grad norm: 0.326 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:18:00.017063 | finish at 2025-09-10 11:48:11 + [2025-09-10 01:30:16] iteration 5340/ 11920 | consumed samples: 5468160 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.960698E+00 | loss scale: 1.0 | grad norm: 0.245 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:16:28.938279 | finish at 2025-09-10 11:46:45 + [2025-09-10 01:30:22] iteration 5341/ 11920 | consumed samples: 5469184 | elapsed time per iteration (ms): 5634.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.960182E+00 | loss scale: 1.0 | grad norm: 0.327 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:17:45.821328 | finish at 2025-09-10 11:48:08 + [2025-09-10 01:30:28] iteration 5342/ 11920 | consumed samples: 5470208 | elapsed time per iteration (ms): 5631.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.968827E+00 | loss scale: 1.0 | grad norm: 0.292 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:17:20.677500 | finish at 2025-09-10 11:47:48 + [2025-09-10 01:30:33] iteration 5343/ 11920 | consumed samples: 5471232 | elapsed time per iteration (ms): 5628.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.961267E+00 | loss scale: 1.0 | grad norm: 0.278 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:16:58.175544 | finish at 2025-09-10 11:47:32 + [2025-09-10 01:30:39] iteration 5344/ 11920 | consumed samples: 5472256 | elapsed time per iteration (ms): 5971.4 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.952047E+00 | loss scale: 1.0 | grad norm: 0.314 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:54:27.710186 | finish at 2025-09-10 12:25:07 + [2025-09-10 01:30:45] iteration 5345/ 11920 | consumed samples: 5473280 | elapsed time per iteration (ms): 5633.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.968842E+00 | loss scale: 1.0 | grad norm: 0.276 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:17:19.937091 | finish at 2025-09-10 11:48:05 + [2025-09-10 01:30:51] iteration 5346/ 11920 | consumed samples: 5474304 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.958724E+00 | loss scale: 1.0 | grad norm: 0.275 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:15:58.046692 | finish at 2025-09-10 11:46:49 + [2025-09-10 01:30:56] iteration 5347/ 11920 | consumed samples: 5475328 | elapsed time per iteration (ms): 5627.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.950212E+00 | loss scale: 1.0 | grad norm: 0.307 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:16:26.121170 | finish at 2025-09-10 11:47:22 + [2025-09-10 01:31:02] iteration 5348/ 11920 | consumed samples: 5476352 | elapsed time per iteration (ms): 5637.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.968042E+00 | loss scale: 1.0 | grad norm: 0.244 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:17:29.272694 | finish at 2025-09-10 11:48:31 + [2025-09-10 01:31:08] iteration 5349/ 11920 | consumed samples: 5477376 | elapsed time per iteration (ms): 5842.0 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.956561E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:39:47.479687 | finish at 2025-09-10 12:10:55 + [2025-09-10 01:31:14] iteration 5350/ 11920 | consumed samples: 5478400 | elapsed time per iteration (ms): 5999.9 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.964978E+00 | loss scale: 1.0 | grad norm: 0.255 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:56:59.536343 | finish at 2025-09-10 12:28:13 + [2025-09-10 01:31:20] iteration 5351/ 11920 | consumed samples: 5479424 | elapsed time per iteration (ms): 6464.1 | throughput per GPU (TFLOP/s/GPU): 69.8 | MFU 7.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.954479E+00 | loss scale: 1.0 | grad norm: 0.276 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:47:42.681535 | finish at 2025-09-10 13:19:03 + [2025-09-10 01:31:26] iteration 5352/ 11920 | consumed samples: 5480448 | elapsed time per iteration (ms): 5628.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.946590E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:16:09.832569 | finish at 2025-09-10 11:47:36 + [2025-09-10 01:31:31] iteration 5353/ 11920 | consumed samples: 5481472 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.978793E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:15:19.282439 | finish at 2025-09-10 11:46:51 + [2025-09-10 01:31:37] iteration 5354/ 11920 | consumed samples: 5482496 | elapsed time per iteration (ms): 5912.6 | throughput per GPU (TFLOP/s/GPU): 76.4 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.952961E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:47:01.992540 | finish at 2025-09-10 12:18:39 + [2025-09-10 01:31:43] iteration 5355/ 11920 | consumed samples: 5483520 | elapsed time per iteration (ms): 5931.8 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.956641E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:49:02.116008 | finish at 2025-09-10 12:20:45 + [2025-09-10 01:31:49] iteration 5356/ 11920 | consumed samples: 5484544 | elapsed time per iteration (ms): 5954.4 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.954678E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:51:24.386230 | finish at 2025-09-10 12:23:14 + [2025-09-10 01:31:55] iteration 5357/ 11920 | consumed samples: 5485568 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.950214E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:14:53.422660 | finish at 2025-09-10 11:46:48 + [2025-09-10 01:32:00] iteration 5358/ 11920 | consumed samples: 5486592 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.949905E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:15:15.558640 | finish at 2025-09-10 11:47:16 + [2025-09-10 01:32:06] iteration 5359/ 11920 | consumed samples: 5487616 | elapsed time per iteration (ms): 5997.2 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.959772E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:55:47.344584 | finish at 2025-09-10 12:27:54 + [2025-09-10 01:32:12] iteration 5360/ 11920 | consumed samples: 5488640 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.959504E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:15:07.444763 | finish at 2025-09-10 11:47:20 + [2025-09-10 01:32:18] iteration 5361/ 11920 | consumed samples: 5489664 | elapsed time per iteration (ms): 5834.9 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.947995E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:37:51.296114 | finish at 2025-09-10 12:10:09 + [2025-09-10 01:32:24] iteration 5362/ 11920 | consumed samples: 5490688 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.945663E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:14:29.612160 | finish at 2025-09-10 11:46:53 + [2025-09-10 01:32:29] iteration 5363/ 11920 | consumed samples: 5491712 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.946987E+00 | loss scale: 1.0 | grad norm: 0.246 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:14:36.551279 | finish at 2025-09-10 11:47:06 + [2025-09-10 01:32:35] iteration 5364/ 11920 | consumed samples: 5492736 | elapsed time per iteration (ms): 5622.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.947295E+00 | loss scale: 1.0 | grad norm: 0.260 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:14:21.254991 | finish at 2025-09-10 11:46:56 + [2025-09-10 01:32:41] iteration 5365/ 11920 | consumed samples: 5493760 | elapsed time per iteration (ms): 5837.4 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.953210E+00 | loss scale: 1.0 | grad norm: 0.244 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:37:44.170800 | finish at 2025-09-10 12:10:25 + [2025-09-10 01:32:46] iteration 5366/ 11920 | consumed samples: 5494784 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.946378E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:14:05.500296 | finish at 2025-09-10 11:46:52 + [2025-09-10 01:32:53] iteration 5367/ 11920 | consumed samples: 5495808 | elapsed time per iteration (ms): 6609.5 | throughput per GPU (TFLOP/s/GPU): 68.3 | MFU 6.91% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.950186E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 12:01:52.079610 | finish at 2025-09-10 13:34:45 + [2025-09-10 01:32:58] iteration 5368/ 11920 | consumed samples: 5496832 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.946882E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:13:49.442179 | finish at 2025-09-10 11:46:48 + [2025-09-10 01:33:04] iteration 5369/ 11920 | consumed samples: 5497856 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.964016E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:13:58.452772 | finish at 2025-09-10 11:47:03 + [2025-09-10 01:33:10] iteration 5370/ 11920 | consumed samples: 5498880 | elapsed time per iteration (ms): 5874.8 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.956565E+00 | loss scale: 1.0 | grad norm: 0.264 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:41:20.163097 | finish at 2025-09-10 12:14:30 + [2025-09-10 01:33:16] iteration 5371/ 11920 | consumed samples: 5499904 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.950721E+00 | loss scale: 1.0 | grad norm: 0.330 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:14:00.095491 | finish at 2025-09-10 11:47:16 + [2025-09-10 01:33:21] iteration 5372/ 11920 | consumed samples: 5500928 | elapsed time per iteration (ms): 5630.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.942165E+00 | loss scale: 1.0 | grad norm: 0.315 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:14:26.790986 | finish at 2025-09-10 11:47:48 + [2025-09-10 01:33:27] iteration 5373/ 11920 | consumed samples: 5501952 | elapsed time per iteration (ms): 5617.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.955361E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:13:00.164276 | finish at 2025-09-10 11:46:27 + [2025-09-10 01:33:32] iteration 5374/ 11920 | consumed samples: 5502976 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.947037E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:13:21.675851 | finish at 2025-09-10 11:46:54 + [2025-09-10 01:33:38] iteration 5375/ 11920 | consumed samples: 5504000 | elapsed time per iteration (ms): 5627.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.957032E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:13:52.465372 | finish at 2025-09-10 11:47:31 + [2025-09-10 01:33:44] iteration 5376/ 11920 | consumed samples: 5505024 | elapsed time per iteration (ms): 5630.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.956422E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:14:08.247017 | finish at 2025-09-10 11:47:52 + [2025-09-10 01:33:49] iteration 5377/ 11920 | consumed samples: 5506048 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.955477E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:13:24.791513 | finish at 2025-09-10 11:47:14 + [2025-09-10 01:33:55] iteration 5378/ 11920 | consumed samples: 5507072 | elapsed time per iteration (ms): 5634.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.947470E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:14:23.472736 | finish at 2025-09-10 11:48:18 + [2025-09-10 01:34:01] iteration 5379/ 11920 | consumed samples: 5508096 | elapsed time per iteration (ms): 5977.5 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.934631E+00 | loss scale: 1.0 | grad norm: 0.135 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:51:39.020628 | finish at 2025-09-10 12:25:40 + [2025-09-10 01:34:07] iteration 5380/ 11920 | consumed samples: 5509120 | elapsed time per iteration (ms): 5984.5 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.942799E+00 | loss scale: 1.0 | grad norm: 0.136 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:52:18.514166 | finish at 2025-09-10 12:26:25 + [2025-09-10 01:34:13] iteration 5381/ 11920 | consumed samples: 5510144 | elapsed time per iteration (ms): 5635.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.936995E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:14:11.438432 | finish at 2025-09-10 11:48:24 + [2025-09-10 01:34:18] iteration 5382/ 11920 | consumed samples: 5511168 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.948866E+00 | loss scale: 1.0 | grad norm: 0.128 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:12:34.370955 | finish at 2025-09-10 11:46:53 + [2025-09-10 01:34:24] iteration 5383/ 11920 | consumed samples: 5512192 | elapsed time per iteration (ms): 5854.4 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.944087E+00 | loss scale: 1.0 | grad norm: 0.121 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:37:50.347867 | finish at 2025-09-10 12:12:14 + [2025-09-10 01:34:30] iteration 5384/ 11920 | consumed samples: 5513216 | elapsed time per iteration (ms): 5626.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930978E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:12:54.490070 | finish at 2025-09-10 11:47:24 + [2025-09-10 01:34:35] iteration 5385/ 11920 | consumed samples: 5514240 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.947028E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:12:44.702026 | finish at 2025-09-10 11:47:20 + [2025-09-10 01:34:41] iteration 5386/ 11920 | consumed samples: 5515264 | elapsed time per iteration (ms): 5636.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.937730E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:13:51.560343 | finish at 2025-09-10 11:48:33 + [2025-09-10 01:34:47] iteration 5387/ 11920 | consumed samples: 5516288 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.957149E+00 | loss scale: 1.0 | grad norm: 0.142 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:12:28.442748 | finish at 2025-09-10 11:47:15 + [2025-09-10 01:34:52] iteration 5388/ 11920 | consumed samples: 5517312 | elapsed time per iteration (ms): 5832.3 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.944066E+00 | loss scale: 1.0 | grad norm: 0.112 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:34:56.306050 | finish at 2025-09-10 12:09:49 + [2025-09-10 01:34:58] iteration 5389/ 11920 | consumed samples: 5518336 | elapsed time per iteration (ms): 5628.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.931662E+00 | loss scale: 1.0 | grad norm: 0.107 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:12:41.981870 | finish at 2025-09-10 11:47:40 + [2025-09-10 01:35:04] iteration 5390/ 11920 | consumed samples: 5519360 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.937558E+00 | loss scale: 1.0 | grad norm: 0.122 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:12:10.141506 | finish at 2025-09-10 11:47:14 + [2025-09-10 01:35:09] iteration 5391/ 11920 | consumed samples: 5520384 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.936030E+00 | loss scale: 1.0 | grad norm: 0.134 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:11:37.400096 | finish at 2025-09-10 11:46:47 + [2025-09-10 01:35:15] iteration 5392/ 11920 | consumed samples: 5521408 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.929936E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:11:38.560638 | finish at 2025-09-10 11:46:53 + [2025-09-10 01:35:21] iteration 5393/ 11920 | consumed samples: 5522432 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.926581E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:11:53.179871 | finish at 2025-09-10 11:47:14 + [2025-09-10 01:35:26] iteration 5394/ 11920 | consumed samples: 5523456 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.939162E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:11:35.059463 | finish at 2025-09-10 11:47:01 + [2025-09-10 01:35:32] iteration 5395/ 11920 | consumed samples: 5524480 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.955133E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:11:42.269375 | finish at 2025-09-10 11:47:14 + [2025-09-10 01:35:37] iteration 5396/ 11920 | consumed samples: 5525504 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.943638E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:11:29.730563 | finish at 2025-09-10 11:47:07 + [2025-09-10 01:35:43] iteration 5397/ 11920 | consumed samples: 5526528 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.934723E+00 | loss scale: 1.0 | grad norm: 0.242 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:11:33.814340 | finish at 2025-09-10 11:47:17 + [2025-09-10 01:35:49] iteration 5398/ 11920 | consumed samples: 5527552 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.943976E+00 | loss scale: 1.0 | grad norm: 0.270 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:11:26.494130 | finish at 2025-09-10 11:47:15 + [2025-09-10 01:35:54] iteration 5399/ 11920 | consumed samples: 5528576 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.948368E+00 | loss scale: 1.0 | grad norm: 0.251 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:11:15.183454 | finish at 2025-09-10 11:47:09 + [2025-09-10 01:36:00] iteration 5400/ 11920 | consumed samples: 5529600 | elapsed time per iteration (ms): 5630.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930929E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:11:51.513395 | finish at 2025-09-10 11:47:51 + [2025-09-10 01:36:06] iteration 5401/ 11920 | consumed samples: 5530624 | elapsed time per iteration (ms): 5968.7 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.947216E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 12.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:48:29.737262 | finish at 2025-09-10 12:24:36 + [2025-09-10 01:36:12] iteration 5402/ 11920 | consumed samples: 5531648 | elapsed time per iteration (ms): 5629.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.953265E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:11:29.608765 | finish at 2025-09-10 11:47:41 + [2025-09-10 01:36:17] iteration 5403/ 11920 | consumed samples: 5532672 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.925446E+00 | loss scale: 1.0 | grad norm: 0.135 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:10:33.343860 | finish at 2025-09-10 11:46:50 + [2025-09-10 01:36:23] iteration 5404/ 11920 | consumed samples: 5533696 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.937127E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:10:50.683917 | finish at 2025-09-10 11:47:13 + [2025-09-10 01:36:28] iteration 5405/ 11920 | consumed samples: 5534720 | elapsed time per iteration (ms): 5628.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.940540E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:11:11.844250 | finish at 2025-09-10 11:47:40 + [2025-09-10 01:36:34] iteration 5406/ 11920 | consumed samples: 5535744 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.952593E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:10:22.389656 | finish at 2025-09-10 11:46:56 + [2025-09-10 01:36:40] iteration 5407/ 11920 | consumed samples: 5536768 | elapsed time per iteration (ms): 5628.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.935106E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:10:59.616072 | finish at 2025-09-10 11:47:39 + [2025-09-10 01:36:45] iteration 5408/ 11920 | consumed samples: 5537792 | elapsed time per iteration (ms): 5626.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.951242E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:10:36.429241 | finish at 2025-09-10 11:47:22 + [2025-09-10 01:36:51] iteration 5409/ 11920 | consumed samples: 5538816 | elapsed time per iteration (ms): 5940.1 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.942147E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:44:35.958205 | finish at 2025-09-10 12:21:27 + [2025-09-10 01:36:57] iteration 5410/ 11920 | consumed samples: 5539840 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.934728E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:10:12.656436 | finish at 2025-09-10 11:47:09 + [2025-09-10 01:37:02] iteration 5411/ 11920 | consumed samples: 5540864 | elapsed time per iteration (ms): 5631.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.937861E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:10:51.963564 | finish at 2025-09-10 11:47:54 + [2025-09-10 01:37:08] iteration 5412/ 11920 | consumed samples: 5541888 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.941544E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:10:03.732647 | finish at 2025-09-10 11:47:12 + [2025-09-10 01:37:14] iteration 5413/ 11920 | consumed samples: 5542912 | elapsed time per iteration (ms): 5618.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.936715E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:09:21.206871 | finish at 2025-09-10 11:46:35 + [2025-09-10 01:37:19] iteration 5414/ 11920 | consumed samples: 5543936 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.950611E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:10:00.414841 | finish at 2025-09-10 11:47:20 + [2025-09-10 01:37:25] iteration 5415/ 11920 | consumed samples: 5544960 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.946073E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:09:46.594177 | finish at 2025-09-10 11:47:12 + [2025-09-10 01:37:31] iteration 5416/ 11920 | consumed samples: 5545984 | elapsed time per iteration (ms): 5617.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.926302E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:08:58.306091 | finish at 2025-09-10 11:46:29 + [2025-09-10 01:37:36] iteration 5417/ 11920 | consumed samples: 5547008 | elapsed time per iteration (ms): 5618.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.928620E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:08:58.494653 | finish at 2025-09-10 11:46:35 + [2025-09-10 01:37:42] iteration 5418/ 11920 | consumed samples: 5548032 | elapsed time per iteration (ms): 5632.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.933320E+00 | loss scale: 1.0 | grad norm: 0.270 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:10:19.542837 | finish at 2025-09-10 11:48:01 + [2025-09-10 01:37:47] iteration 5419/ 11920 | consumed samples: 5549056 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.931001E+00 | loss scale: 1.0 | grad norm: 0.249 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:09:24.474846 | finish at 2025-09-10 11:47:12 + [2025-09-10 01:37:53] iteration 5420/ 11920 | consumed samples: 5550080 | elapsed time per iteration (ms): 6005.4 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.937458E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:50:34.947753 | finish at 2025-09-10 12:28:28 + [2025-09-10 01:37:59] iteration 5421/ 11920 | consumed samples: 5551104 | elapsed time per iteration (ms): 5628.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.931035E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:09:40.806755 | finish at 2025-09-10 11:47:40 + [2025-09-10 01:38:05] iteration 5422/ 11920 | consumed samples: 5552128 | elapsed time per iteration (ms): 5628.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.946200E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:09:35.119201 | finish at 2025-09-10 11:47:40 + [2025-09-10 01:38:10] iteration 5423/ 11920 | consumed samples: 5553152 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.935518E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:08:38.456992 | finish at 2025-09-10 11:46:49 + [2025-09-10 01:38:16] iteration 5424/ 11920 | consumed samples: 5554176 | elapsed time per iteration (ms): 5618.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.933550E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:08:19.219414 | finish at 2025-09-10 11:46:35 + [2025-09-10 01:38:22] iteration 5425/ 11920 | consumed samples: 5555200 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.932308E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:08:40.404174 | finish at 2025-09-10 11:47:02 + [2025-09-10 01:38:27] iteration 5426/ 11920 | consumed samples: 5556224 | elapsed time per iteration (ms): 5633.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.916265E+00 | loss scale: 1.0 | grad norm: 0.127 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:09:42.223300 | finish at 2025-09-10 11:48:09 + [2025-09-10 01:38:33] iteration 5427/ 11920 | consumed samples: 5557248 | elapsed time per iteration (ms): 5625.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.926558E+00 | loss scale: 1.0 | grad norm: 0.123 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:08:48.694890 | finish at 2025-09-10 11:47:22 + [2025-09-10 01:38:38] iteration 5428/ 11920 | consumed samples: 5558272 | elapsed time per iteration (ms): 5622.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.936760E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:08:21.958405 | finish at 2025-09-10 11:47:00 + [2025-09-10 01:38:44] iteration 5429/ 11920 | consumed samples: 5559296 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.933535E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:08:10.732030 | finish at 2025-09-10 11:46:55 + [2025-09-10 01:38:50] iteration 5430/ 11920 | consumed samples: 5560320 | elapsed time per iteration (ms): 5619.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.928622E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:07:49.519324 | finish at 2025-09-10 11:46:39 + [2025-09-10 01:38:55] iteration 5431/ 11920 | consumed samples: 5561344 | elapsed time per iteration (ms): 5618.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.942393E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:07:36.473913 | finish at 2025-09-10 11:46:32 + [2025-09-10 01:39:01] iteration 5432/ 11920 | consumed samples: 5562368 | elapsed time per iteration (ms): 5631.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.931910E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:08:59.379406 | finish at 2025-09-10 11:48:00 + [2025-09-10 01:39:07] iteration 5433/ 11920 | consumed samples: 5563392 | elapsed time per iteration (ms): 5630.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.940960E+00 | loss scale: 1.0 | grad norm: 0.242 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:08:46.736731 | finish at 2025-09-10 11:47:53 + [2025-09-10 01:39:12] iteration 5434/ 11920 | consumed samples: 5564416 | elapsed time per iteration (ms): 5632.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.933425E+00 | loss scale: 1.0 | grad norm: 0.311 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:08:52.846111 | finish at 2025-09-10 11:48:05 + [2025-09-10 01:39:18] iteration 5435/ 11920 | consumed samples: 5565440 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.941507E+00 | loss scale: 1.0 | grad norm: 0.301 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:07:48.549727 | finish at 2025-09-10 11:47:06 + [2025-09-10 01:39:23] iteration 5436/ 11920 | consumed samples: 5566464 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.954063E+00 | loss scale: 1.0 | grad norm: 0.359 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:07:31.925536 | finish at 2025-09-10 11:46:55 + [2025-09-10 01:39:29] iteration 5437/ 11920 | consumed samples: 5567488 | elapsed time per iteration (ms): 5639.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.941198E+00 | loss scale: 1.0 | grad norm: 0.485 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:09:23.443678 | finish at 2025-09-10 11:48:53 + [2025-09-10 01:39:35] iteration 5438/ 11920 | consumed samples: 5568512 | elapsed time per iteration (ms): 5628.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.942508E+00 | loss scale: 1.0 | grad norm: 0.296 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:08:06.227227 | finish at 2025-09-10 11:47:41 + [2025-09-10 01:39:40] iteration 5439/ 11920 | consumed samples: 5569536 | elapsed time per iteration (ms): 5631.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.947213E+00 | loss scale: 1.0 | grad norm: 0.265 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:08:15.188066 | finish at 2025-09-10 11:47:56 + [2025-09-10 01:39:46] iteration 5440/ 11920 | consumed samples: 5570560 | elapsed time per iteration (ms): 5630.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.928130E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:08:04.968452 | finish at 2025-09-10 11:47:51 + [2025-09-10 01:39:52] iteration 5441/ 11920 | consumed samples: 5571584 | elapsed time per iteration (ms): 5626.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.927390E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:07:34.009381 | finish at 2025-09-10 11:47:26 + [2025-09-10 01:39:57] iteration 5442/ 11920 | consumed samples: 5572608 | elapsed time per iteration (ms): 5618.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930807E+00 | loss scale: 1.0 | grad norm: 0.259 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:06:34.879172 | finish at 2025-09-10 11:46:32 + [2025-09-10 01:40:03] iteration 5443/ 11920 | consumed samples: 5573632 | elapsed time per iteration (ms): 5637.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.948527E+00 | loss scale: 1.0 | grad norm: 0.498 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:08:32.254799 | finish at 2025-09-10 11:48:35 + [2025-09-10 01:40:09] iteration 5444/ 11920 | consumed samples: 5574656 | elapsed time per iteration (ms): 5637.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.945246E+00 | loss scale: 1.0 | grad norm: 1.845 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:08:29.977324 | finish at 2025-09-10 11:48:38 + [2025-09-10 01:40:14] iteration 5445/ 11920 | consumed samples: 5575680 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.963698E+00 | loss scale: 1.0 | grad norm: 0.344 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:06:52.283617 | finish at 2025-09-10 11:47:06 + [2025-09-10 01:40:20] iteration 5446/ 11920 | consumed samples: 5576704 | elapsed time per iteration (ms): 5629.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.986412E+00 | loss scale: 1.0 | grad norm: 0.363 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:07:25.620134 | finish at 2025-09-10 11:47:45 + [2025-09-10 01:40:25] iteration 5447/ 11920 | consumed samples: 5577728 | elapsed time per iteration (ms): 5642.2 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.952158E+00 | loss scale: 1.0 | grad norm: 0.435 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:08:42.185875 | finish at 2025-09-10 11:49:08 + [2025-09-10 01:40:31] iteration 5448/ 11920 | consumed samples: 5578752 | elapsed time per iteration (ms): 5643.6 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.954195E+00 | loss scale: 1.0 | grad norm: 0.668 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:08:45.280361 | finish at 2025-09-10 11:49:16 + [2025-09-10 01:40:37] iteration 5449/ 11920 | consumed samples: 5579776 | elapsed time per iteration (ms): 5644.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.960803E+00 | loss scale: 1.0 | grad norm: 0.701 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:08:43.066436 | finish at 2025-09-10 11:49:20 + [2025-09-10 01:40:42] iteration 5450/ 11920 | consumed samples: 5580800 | elapsed time per iteration (ms): 5651.1 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.953351E+00 | loss scale: 1.0 | grad norm: 0.449 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:09:22.844784 | finish at 2025-09-10 11:50:05 + [2025-09-10 01:40:48] iteration 5451/ 11920 | consumed samples: 5581824 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.948755E+00 | loss scale: 1.0 | grad norm: 0.378 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:06:35.475744 | finish at 2025-09-10 11:47:23 + [2025-09-10 01:40:54] iteration 5452/ 11920 | consumed samples: 5582848 | elapsed time per iteration (ms): 5626.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.946205E+00 | loss scale: 1.0 | grad norm: 0.251 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:06:34.972435 | finish at 2025-09-10 11:47:29 + [2025-09-10 01:40:59] iteration 5453/ 11920 | consumed samples: 5583872 | elapsed time per iteration (ms): 5633.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.948275E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:07:12.821134 | finish at 2025-09-10 11:48:12 + [2025-09-10 01:41:05] iteration 5454/ 11920 | consumed samples: 5584896 | elapsed time per iteration (ms): 5627.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.952943E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:06:25.554641 | finish at 2025-09-10 11:47:30 + [2025-09-10 01:41:10] iteration 5455/ 11920 | consumed samples: 5585920 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.955342E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:05:36.477578 | finish at 2025-09-10 11:46:47 + [2025-09-10 01:41:16] iteration 5456/ 11920 | consumed samples: 5586944 | elapsed time per iteration (ms): 5828.3 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.943692E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:27:54.366989 | finish at 2025-09-10 12:09:11 + [2025-09-10 01:41:22] iteration 5457/ 11920 | consumed samples: 5587968 | elapsed time per iteration (ms): 5873.7 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.941000E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:32:41.805685 | finish at 2025-09-10 12:14:04 + [2025-09-10 01:41:28] iteration 5458/ 11920 | consumed samples: 5588992 | elapsed time per iteration (ms): 5633.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.940836E+00 | loss scale: 1.0 | grad norm: 0.142 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:06:41.101656 | finish at 2025-09-10 11:48:09 + [2025-09-10 01:41:33] iteration 5459/ 11920 | consumed samples: 5590016 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.933901E+00 | loss scale: 1.0 | grad norm: 0.110 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:05:47.761672 | finish at 2025-09-10 11:47:21 + [2025-09-10 01:41:39] iteration 5460/ 11920 | consumed samples: 5591040 | elapsed time per iteration (ms): 5631.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930874E+00 | loss scale: 1.0 | grad norm: 0.109 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:06:16.959515 | finish at 2025-09-10 11:47:56 + [2025-09-10 01:41:45] iteration 5461/ 11920 | consumed samples: 5592064 | elapsed time per iteration (ms): 5640.2 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.918594E+00 | loss scale: 1.0 | grad norm: 0.126 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:07:10.183587 | finish at 2025-09-10 11:48:55 + [2025-09-10 01:41:50] iteration 5462/ 11920 | consumed samples: 5593088 | elapsed time per iteration (ms): 5631.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.939093E+00 | loss scale: 1.0 | grad norm: 0.136 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:06:07.625011 | finish at 2025-09-10 11:47:58 + [2025-09-10 01:41:56] iteration 5463/ 11920 | consumed samples: 5594112 | elapsed time per iteration (ms): 5955.6 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.925529E+00 | loss scale: 1.0 | grad norm: 0.128 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:40:55.517179 | finish at 2025-09-10 12:22:52 + [2025-09-10 01:42:02] iteration 5464/ 11920 | consumed samples: 5595136 | elapsed time per iteration (ms): 5960.3 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.941718E+00 | loss scale: 1.0 | grad norm: 0.114 | num zeros: 10.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:41:19.852060 | finish at 2025-09-10 12:23:22 + [2025-09-10 01:42:08] iteration 5465/ 11920 | consumed samples: 5596160 | elapsed time per iteration (ms): 5913.3 | throughput per GPU (TFLOP/s/GPU): 76.4 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.932576E+00 | loss scale: 1.0 | grad norm: 0.111 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:36:10.154750 | finish at 2025-09-10 12:18:18 + [2025-09-10 01:42:14] iteration 5466/ 11920 | consumed samples: 5597184 | elapsed time per iteration (ms): 5828.9 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.907027E+00 | loss scale: 1.0 | grad norm: 0.123 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:27:00.041298 | finish at 2025-09-10 12:09:14 + [2025-09-10 01:42:20] iteration 5467/ 11920 | consumed samples: 5598208 | elapsed time per iteration (ms): 6276.3 | throughput per GPU (TFLOP/s/GPU): 71.9 | MFU 7.27% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.923144E+00 | loss scale: 1.0 | grad norm: 0.121 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:15:00.923598 | finish at 2025-09-10 12:57:21 + [2025-09-10 01:42:26] iteration 5468/ 11920 | consumed samples: 5599232 | elapsed time per iteration (ms): 6032.5 | throughput per GPU (TFLOP/s/GPU): 74.8 | MFU 7.57% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.941694E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:48:41.585582 | finish at 2025-09-10 12:31:08 + [2025-09-10 01:42:32] iteration 5469/ 11920 | consumed samples: 5600256 | elapsed time per iteration (ms): 5632.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.912457E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:05:37.653333 | finish at 2025-09-10 11:48:10 + [2025-09-10 01:42:38] iteration 5470/ 11920 | consumed samples: 5601280 | elapsed time per iteration (ms): 5632.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.929654E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:05:29.121709 | finish at 2025-09-10 11:48:07 + [2025-09-10 01:42:43] iteration 5471/ 11920 | consumed samples: 5602304 | elapsed time per iteration (ms): 5901.0 | throughput per GPU (TFLOP/s/GPU): 76.5 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.920352E+00 | loss scale: 1.0 | grad norm: 0.249 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:34:15.730580 | finish at 2025-09-10 12:16:59 + [2025-09-10 01:42:49] iteration 5472/ 11920 | consumed samples: 5603328 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.939067E+00 | loss scale: 1.0 | grad norm: 0.261 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:04:12.881908 | finish at 2025-09-10 11:47:02 + [2025-09-10 01:42:55] iteration 5473/ 11920 | consumed samples: 5604352 | elapsed time per iteration (ms): 5632.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.913437E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:05:11.591164 | finish at 2025-09-10 11:48:06 + [2025-09-10 01:43:01] iteration 5474/ 11920 | consumed samples: 5605376 | elapsed time per iteration (ms): 5831.1 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.943183E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:26:27.239779 | finish at 2025-09-10 12:09:28 + [2025-09-10 01:43:07] iteration 5475/ 11920 | consumed samples: 5606400 | elapsed time per iteration (ms): 6200.2 | throughput per GPU (TFLOP/s/GPU): 72.8 | MFU 7.36% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.936353E+00 | loss scale: 1.0 | grad norm: 0.135 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:06:00.331010 | finish at 2025-09-10 12:49:07 + [2025-09-10 01:43:12] iteration 5476/ 11920 | consumed samples: 5607424 | elapsed time per iteration (ms): 5630.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.939903E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:04:43.415706 | finish at 2025-09-10 11:47:56 + [2025-09-10 01:43:18] iteration 5477/ 11920 | consumed samples: 5608448 | elapsed time per iteration (ms): 5623.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.929307E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:03:48.867043 | finish at 2025-09-10 11:47:07 + [2025-09-10 01:43:24] iteration 5478/ 11920 | consumed samples: 5609472 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.928272E+00 | loss scale: 1.0 | grad norm: 0.131 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:03:27.167876 | finish at 2025-09-10 11:46:51 + [2025-09-10 01:43:29] iteration 5479/ 11920 | consumed samples: 5610496 | elapsed time per iteration (ms): 5619.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930329E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:03:15.777938 | finish at 2025-09-10 11:46:45 + [2025-09-10 01:43:35] iteration 5480/ 11920 | consumed samples: 5611520 | elapsed time per iteration (ms): 6020.2 | throughput per GPU (TFLOP/s/GPU): 75.0 | MFU 7.58% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.925801E+00 | loss scale: 1.0 | grad norm: 0.129 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:46:09.784079 | finish at 2025-09-10 12:29:45 + [2025-09-10 01:43:41] iteration 5481/ 11920 | consumed samples: 5612544 | elapsed time per iteration (ms): 5996.5 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.934631E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:43:31.471274 | finish at 2025-09-10 12:27:13 + [2025-09-10 01:43:47] iteration 5482/ 11920 | consumed samples: 5613568 | elapsed time per iteration (ms): 5921.5 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.914709E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:35:22.810194 | finish at 2025-09-10 12:19:10 + [2025-09-10 01:43:53] iteration 5483/ 11920 | consumed samples: 5614592 | elapsed time per iteration (ms): 5824.4 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.84% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.926675E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:24:51.843561 | finish at 2025-09-10 12:08:45 + [2025-09-10 01:43:59] iteration 5484/ 11920 | consumed samples: 5615616 | elapsed time per iteration (ms): 5919.1 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.927325E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:34:55.100797 | finish at 2025-09-10 12:18:54 + [2025-09-10 01:44:05] iteration 5485/ 11920 | consumed samples: 5616640 | elapsed time per iteration (ms): 5979.1 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.936314E+00 | loss scale: 1.0 | grad norm: 0.257 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:41:15.818342 | finish at 2025-09-10 12:25:21 + [2025-09-10 01:44:11] iteration 5486/ 11920 | consumed samples: 5617664 | elapsed time per iteration (ms): 5954.0 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.929446E+00 | loss scale: 1.0 | grad norm: 0.247 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:38:28.052849 | finish at 2025-09-10 12:22:39 + [2025-09-10 01:44:17] iteration 5487/ 11920 | consumed samples: 5618688 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.934724E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:02:59.954738 | finish at 2025-09-10 11:47:16 + [2025-09-10 01:44:22] iteration 5488/ 11920 | consumed samples: 5619712 | elapsed time per iteration (ms): 5636.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.932251E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:04:13.964172 | finish at 2025-09-10 11:48:36 + [2025-09-10 01:44:28] iteration 5489/ 11920 | consumed samples: 5620736 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.924769E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:02:45.675227 | finish at 2025-09-10 11:47:13 + [2025-09-10 01:44:33] iteration 5490/ 11920 | consumed samples: 5621760 | elapsed time per iteration (ms): 5631.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.926363E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:03:32.243636 | finish at 2025-09-10 11:48:06 + [2025-09-10 01:44:39] iteration 5491/ 11920 | consumed samples: 5622784 | elapsed time per iteration (ms): 5981.0 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.927424E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:40:51.618738 | finish at 2025-09-10 12:25:31 + [2025-09-10 01:44:45] iteration 5492/ 11920 | consumed samples: 5623808 | elapsed time per iteration (ms): 5637.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.922510E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:04:00.097032 | finish at 2025-09-10 11:48:45 + [2025-09-10 01:44:51] iteration 5493/ 11920 | consumed samples: 5624832 | elapsed time per iteration (ms): 5618.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.911573E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:01:50.442702 | finish at 2025-09-10 11:46:41 + [2025-09-10 01:44:56] iteration 5494/ 11920 | consumed samples: 5625856 | elapsed time per iteration (ms): 5839.4 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.908311E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:25:24.121096 | finish at 2025-09-10 12:10:21 + [2025-09-10 01:45:02] iteration 5495/ 11920 | consumed samples: 5626880 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930223E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:02:29.210960 | finish at 2025-09-10 11:47:31 + [2025-09-10 01:45:08] iteration 5496/ 11920 | consumed samples: 5627904 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.931191E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:02:18.380243 | finish at 2025-09-10 11:47:26 + [2025-09-10 01:45:13] iteration 5497/ 11920 | consumed samples: 5628928 | elapsed time per iteration (ms): 5628.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.913947E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:02:30.605810 | finish at 2025-09-10 11:47:44 + [2025-09-10 01:45:19] iteration 5498/ 11920 | consumed samples: 5629952 | elapsed time per iteration (ms): 5618.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.927299E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:01:21.723705 | finish at 2025-09-10 11:46:41 + [2025-09-10 01:45:25] iteration 5499/ 11920 | consumed samples: 5630976 | elapsed time per iteration (ms): 5841.6 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.910415E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:25:09.130609 | finish at 2025-09-10 12:10:34 + [2025-09-10 01:45:30] iteration 5500/ 11920 | consumed samples: 5632000 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.912678E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:01:46.659050 | finish at 2025-09-10 11:47:17 + [2025-09-10 01:45:36] iteration 5501/ 11920 | consumed samples: 5633024 | elapsed time per iteration (ms): 5627.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.924638E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:02:04.918520 | finish at 2025-09-10 11:47:41 + [2025-09-10 01:45:42] iteration 5502/ 11920 | consumed samples: 5634048 | elapsed time per iteration (ms): 5633.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.920477E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:02:36.033162 | finish at 2025-09-10 11:48:18 + [2025-09-10 01:45:47] iteration 5503/ 11920 | consumed samples: 5635072 | elapsed time per iteration (ms): 5617.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.924786E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:00:48.011986 | finish at 2025-09-10 11:46:35 + [2025-09-10 01:45:53] iteration 5504/ 11920 | consumed samples: 5636096 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.914559E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:00:58.734592 | finish at 2025-09-10 11:46:52 + [2025-09-10 01:45:59] iteration 5505/ 11920 | consumed samples: 5637120 | elapsed time per iteration (ms): 5626.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.921522E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:01:33.158661 | finish at 2025-09-10 11:47:32 + [2025-09-10 01:46:04] iteration 5506/ 11920 | consumed samples: 5638144 | elapsed time per iteration (ms): 5867.6 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930822E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:27:15.024728 | finish at 2025-09-10 12:13:19 + [2025-09-10 01:46:10] iteration 5507/ 11920 | consumed samples: 5639168 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.950009E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:00:43.025532 | finish at 2025-09-10 11:46:53 + [2025-09-10 01:46:16] iteration 5508/ 11920 | consumed samples: 5640192 | elapsed time per iteration (ms): 5629.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.928545E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:01:34.728387 | finish at 2025-09-10 11:47:50 + [2025-09-10 01:46:22] iteration 5509/ 11920 | consumed samples: 5641216 | elapsed time per iteration (ms): 5840.1 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.918287E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:24:00.635332 | finish at 2025-09-10 12:10:22 + [2025-09-10 01:46:27] iteration 5510/ 11920 | consumed samples: 5642240 | elapsed time per iteration (ms): 5635.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.933667E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:02:02.978551 | finish at 2025-09-10 11:48:30 + [2025-09-10 01:46:33] iteration 5511/ 11920 | consumed samples: 5643264 | elapsed time per iteration (ms): 5947.6 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.924140E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:35:17.878885 | finish at 2025-09-10 12:21:51 + [2025-09-10 01:46:39] iteration 5512/ 11920 | consumed samples: 5644288 | elapsed time per iteration (ms): 5633.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.933089E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:01:36.165562 | finish at 2025-09-10 11:48:15 + [2025-09-10 01:46:45] iteration 5513/ 11920 | consumed samples: 5645312 | elapsed time per iteration (ms): 5924.9 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.920890E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:32:41.152382 | finish at 2025-09-10 12:19:26 + [2025-09-10 01:46:50] iteration 5514/ 11920 | consumed samples: 5646336 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.928736E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:00:17.080945 | finish at 2025-09-10 11:47:07 + [2025-09-10 01:46:56] iteration 5515/ 11920 | consumed samples: 5647360 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.915087E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:00:11.394410 | finish at 2025-09-10 11:47:07 + [2025-09-10 01:47:02] iteration 5516/ 11920 | consumed samples: 5648384 | elapsed time per iteration (ms): 5616.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.921672E+00 | loss scale: 1.0 | grad norm: 0.136 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:59:26.225532 | finish at 2025-09-10 11:46:28 + [2025-09-10 01:47:07] iteration 5517/ 11920 | consumed samples: 5649408 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.921527E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:59:57.488781 | finish at 2025-09-10 11:47:05 + [2025-09-10 01:47:13] iteration 5518/ 11920 | consumed samples: 5650432 | elapsed time per iteration (ms): 5623.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.921333E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:59:59.226896 | finish at 2025-09-10 11:47:12 + [2025-09-10 01:47:18] iteration 5519/ 11920 | consumed samples: 5651456 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.911882E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:00:12.579517 | finish at 2025-09-10 11:47:31 + [2025-09-10 01:47:24] iteration 5520/ 11920 | consumed samples: 5652480 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.906604E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:59:31.305847 | finish at 2025-09-10 11:46:55 + [2025-09-10 01:47:30] iteration 5521/ 11920 | consumed samples: 5653504 | elapsed time per iteration (ms): 5618.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.919721E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:59:11.113938 | finish at 2025-09-10 11:46:41 + [2025-09-10 01:47:35] iteration 5522/ 11920 | consumed samples: 5654528 | elapsed time per iteration (ms): 5628.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.919374E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:00:12.668304 | finish at 2025-09-10 11:47:48 + [2025-09-10 01:47:41] iteration 5523/ 11920 | consumed samples: 5655552 | elapsed time per iteration (ms): 5637.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.920751E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:01:02.197109 | finish at 2025-09-10 11:48:43 + [2025-09-10 01:47:47] iteration 5524/ 11920 | consumed samples: 5656576 | elapsed time per iteration (ms): 5627.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.916555E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:59:55.735056 | finish at 2025-09-10 11:47:42 + [2025-09-10 01:47:52] iteration 5525/ 11920 | consumed samples: 5657600 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.927579E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:59:04.046416 | finish at 2025-09-10 11:46:56 + [2025-09-10 01:47:58] iteration 5526/ 11920 | consumed samples: 5658624 | elapsed time per iteration (ms): 5617.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.919163E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:58:39.912867 | finish at 2025-09-10 11:46:38 + [2025-09-10 01:48:04] iteration 5527/ 11920 | consumed samples: 5659648 | elapsed time per iteration (ms): 5850.0 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.908772E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:23:19.157609 | finish at 2025-09-10 12:11:23 + [2025-09-10 01:48:09] iteration 5528/ 11920 | consumed samples: 5660672 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.908815E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:58:50.343664 | finish at 2025-09-10 11:47:00 + [2025-09-10 01:48:15] iteration 5529/ 11920 | consumed samples: 5661696 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.921701E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:59:13.783160 | finish at 2025-09-10 11:47:29 + [2025-09-10 01:48:20] iteration 5530/ 11920 | consumed samples: 5662720 | elapsed time per iteration (ms): 5619.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.911757E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:58:27.044442 | finish at 2025-09-10 11:46:48 + [2025-09-10 01:48:26] iteration 5531/ 11920 | consumed samples: 5663744 | elapsed time per iteration (ms): 5619.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.927822E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:58:20.962116 | finish at 2025-09-10 11:46:47 + [2025-09-10 01:48:32] iteration 5532/ 11920 | consumed samples: 5664768 | elapsed time per iteration (ms): 5619.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.911143E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:58:14.555532 | finish at 2025-09-10 11:46:46 + [2025-09-10 01:48:37] iteration 5533/ 11920 | consumed samples: 5665792 | elapsed time per iteration (ms): 5629.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.927725E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:59:17.985384 | finish at 2025-09-10 11:47:55 + [2025-09-10 01:48:43] iteration 5534/ 11920 | consumed samples: 5666816 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.920595E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:59:09.015059 | finish at 2025-09-10 11:47:52 + [2025-09-10 01:48:49] iteration 5535/ 11920 | consumed samples: 5667840 | elapsed time per iteration (ms): 5630.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.923967E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:59:07.753197 | finish at 2025-09-10 11:47:56 + [2025-09-10 01:48:54] iteration 5536/ 11920 | consumed samples: 5668864 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.924745E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:58:23.427727 | finish at 2025-09-10 11:47:18 + [2025-09-10 01:49:00] iteration 5537/ 11920 | consumed samples: 5669888 | elapsed time per iteration (ms): 5618.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.914857E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:57:41.695396 | finish at 2025-09-10 11:46:42 + [2025-09-10 01:49:05] iteration 5538/ 11920 | consumed samples: 5670912 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.913347E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:58:04.699662 | finish at 2025-09-10 11:47:10 + [2025-09-10 01:49:11] iteration 5539/ 11920 | consumed samples: 5671936 | elapsed time per iteration (ms): 5615.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.927406E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:57:14.950137 | finish at 2025-09-10 11:46:26 + [2025-09-10 01:49:17] iteration 5540/ 11920 | consumed samples: 5672960 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.917274E+00 | loss scale: 1.0 | grad norm: 0.244 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:57:46.114707 | finish at 2025-09-10 11:47:03 + [2025-09-10 01:49:23] iteration 5541/ 11920 | consumed samples: 5673984 | elapsed time per iteration (ms): 6002.8 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.922556E+00 | loss scale: 1.0 | grad norm: 0.245 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:38:11.698389 | finish at 2025-09-10 12:27:34 + [2025-09-10 01:49:28] iteration 5542/ 11920 | consumed samples: 5675008 | elapsed time per iteration (ms): 5622.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.926014E+00 | loss scale: 1.0 | grad norm: 0.282 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:57:40.433889 | finish at 2025-09-10 11:47:09 + [2025-09-10 01:49:34] iteration 5543/ 11920 | consumed samples: 5676032 | elapsed time per iteration (ms): 5633.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.932307E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:58:43.644224 | finish at 2025-09-10 11:48:18 + [2025-09-10 01:49:40] iteration 5544/ 11920 | consumed samples: 5677056 | elapsed time per iteration (ms): 5630.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.935921E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:58:16.639025 | finish at 2025-09-10 11:47:56 + [2025-09-10 01:49:45] iteration 5545/ 11920 | consumed samples: 5678080 | elapsed time per iteration (ms): 5634.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.926245E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:58:37.925298 | finish at 2025-09-10 11:48:23 + [2025-09-10 01:49:51] iteration 5546/ 11920 | consumed samples: 5679104 | elapsed time per iteration (ms): 5619.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.928894E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:57:00.148355 | finish at 2025-09-10 11:46:51 + [2025-09-10 01:49:56] iteration 5547/ 11920 | consumed samples: 5680128 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.923042E+00 | loss scale: 1.0 | grad norm: 0.243 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:56:55.599833 | finish at 2025-09-10 11:46:52 + [2025-09-10 01:50:02] iteration 5548/ 11920 | consumed samples: 5681152 | elapsed time per iteration (ms): 5625.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.924239E+00 | loss scale: 1.0 | grad norm: 0.249 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:57:25.053781 | finish at 2025-09-10 11:47:27 + [2025-09-10 01:50:08] iteration 5549/ 11920 | consumed samples: 5682176 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.921244E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:56:58.199328 | finish at 2025-09-10 11:47:06 + [2025-09-10 01:50:13] iteration 5550/ 11920 | consumed samples: 5683200 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.923589E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:56:51.828527 | finish at 2025-09-10 11:47:05 + [2025-09-10 01:50:19] iteration 5551/ 11920 | consumed samples: 5684224 | elapsed time per iteration (ms): 5627.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.917678E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:57:22.632065 | finish at 2025-09-10 11:47:42 + [2025-09-10 01:50:25] iteration 5552/ 11920 | consumed samples: 5685248 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.932108E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:56:38.686813 | finish at 2025-09-10 11:47:03 + [2025-09-10 01:50:30] iteration 5553/ 11920 | consumed samples: 5686272 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.906089E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:56:37.054493 | finish at 2025-09-10 11:47:07 + [2025-09-10 01:50:36] iteration 5554/ 11920 | consumed samples: 5687296 | elapsed time per iteration (ms): 5635.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.920746E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:57:56.643007 | finish at 2025-09-10 11:48:32 + [2025-09-10 01:50:42] iteration 5555/ 11920 | consumed samples: 5688320 | elapsed time per iteration (ms): 6007.3 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.928153E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:37:16.239269 | finish at 2025-09-10 12:27:58 + [2025-09-10 01:50:47] iteration 5556/ 11920 | consumed samples: 5689344 | elapsed time per iteration (ms): 5629.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.920282E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:57:06.372622 | finish at 2025-09-10 11:47:54 + [2025-09-10 01:50:53] iteration 5557/ 11920 | consumed samples: 5690368 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.931935E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:56:10.578548 | finish at 2025-09-10 11:47:04 + [2025-09-10 01:50:59] iteration 5558/ 11920 | consumed samples: 5691392 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.911850E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:55:53.562551 | finish at 2025-09-10 11:46:52 + [2025-09-10 01:51:04] iteration 5559/ 11920 | consumed samples: 5692416 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.915193E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:56:03.499772 | finish at 2025-09-10 11:47:08 + [2025-09-10 01:51:10] iteration 5560/ 11920 | consumed samples: 5693440 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.911607E+00 | loss scale: 1.0 | grad norm: 0.135 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:55:42.049885 | finish at 2025-09-10 11:46:52 + [2025-09-10 01:51:16] iteration 5561/ 11920 | consumed samples: 5694464 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.909885E+00 | loss scale: 1.0 | grad norm: 0.127 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:55:42.194292 | finish at 2025-09-10 11:46:58 + [2025-09-10 01:51:21] iteration 5562/ 11920 | consumed samples: 5695488 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.907686E+00 | loss scale: 1.0 | grad norm: 0.128 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:56:08.139946 | finish at 2025-09-10 11:47:29 + [2025-09-10 01:51:27] iteration 5563/ 11920 | consumed samples: 5696512 | elapsed time per iteration (ms): 5627.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.912441E+00 | loss scale: 1.0 | grad norm: 0.132 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:56:11.450392 | finish at 2025-09-10 11:47:38 + [2025-09-10 01:51:33] iteration 5564/ 11920 | consumed samples: 5697536 | elapsed time per iteration (ms): 5958.8 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.902791E+00 | loss scale: 1.0 | grad norm: 0.123 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:31:13.892345 | finish at 2025-09-10 12:22:47 + [2025-09-10 01:51:38] iteration 5565/ 11920 | consumed samples: 5698560 | elapsed time per iteration (ms): 5627.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.897646E+00 | loss scale: 1.0 | grad norm: 0.107 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:56:00.614381 | finish at 2025-09-10 11:47:39 + [2025-09-10 01:51:44] iteration 5566/ 11920 | consumed samples: 5699584 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930605E+00 | loss scale: 1.0 | grad norm: 0.110 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:55:14.710263 | finish at 2025-09-10 11:46:59 + [2025-09-10 01:51:50] iteration 5567/ 11920 | consumed samples: 5700608 | elapsed time per iteration (ms): 5625.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.920460E+00 | loss scale: 1.0 | grad norm: 0.110 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:55:38.272649 | finish at 2025-09-10 11:47:28 + [2025-09-10 01:51:56] iteration 5568/ 11920 | consumed samples: 5701632 | elapsed time per iteration (ms): 5819.5 | throughput per GPU (TFLOP/s/GPU): 77.6 | MFU 7.84% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.917381E+00 | loss scale: 1.0 | grad norm: 0.116 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:16:05.645538 | finish at 2025-09-10 12:08:01 + [2025-09-10 01:52:01] iteration 5569/ 11920 | consumed samples: 5702656 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.914563E+00 | loss scale: 1.0 | grad norm: 0.124 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:55:13.474300 | finish at 2025-09-10 11:47:15 + [2025-09-10 01:52:07] iteration 5570/ 11920 | consumed samples: 5703680 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.907045E+00 | loss scale: 1.0 | grad norm: 0.132 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:55:04.859436 | finish at 2025-09-10 11:47:12 + [2025-09-10 01:52:13] iteration 5571/ 11920 | consumed samples: 5704704 | elapsed time per iteration (ms): 5994.9 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.905134E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:34:21.353610 | finish at 2025-09-10 12:26:34 + [2025-09-10 01:52:18] iteration 5572/ 11920 | consumed samples: 5705728 | elapsed time per iteration (ms): 5628.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.912501E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:55:29.478773 | finish at 2025-09-10 11:47:48 + [2025-09-10 01:52:24] iteration 5573/ 11920 | consumed samples: 5706752 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.912902E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:54:30.726412 | finish at 2025-09-10 11:46:55 + [2025-09-10 01:52:30] iteration 5574/ 11920 | consumed samples: 5707776 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.909433E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:54:53.202809 | finish at 2025-09-10 11:47:23 + [2025-09-10 01:52:36] iteration 5575/ 11920 | consumed samples: 5708800 | elapsed time per iteration (ms): 5951.0 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.913414E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:29:19.170481 | finish at 2025-09-10 12:21:55 + [2025-09-10 01:52:41] iteration 5576/ 11920 | consumed samples: 5709824 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.916048E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:54:45.883316 | finish at 2025-09-10 11:47:27 + [2025-09-10 01:52:47] iteration 5577/ 11920 | consumed samples: 5710848 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.910167E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:54:33.247205 | finish at 2025-09-10 11:47:20 + [2025-09-10 01:52:52] iteration 5578/ 11920 | consumed samples: 5711872 | elapsed time per iteration (ms): 5634.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.921061E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:55:31.611641 | finish at 2025-09-10 11:48:24 + [2025-09-10 01:52:58] iteration 5579/ 11920 | consumed samples: 5712896 | elapsed time per iteration (ms): 5626.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.905811E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:54:37.197385 | finish at 2025-09-10 11:47:35 + [2025-09-10 01:53:04] iteration 5580/ 11920 | consumed samples: 5713920 | elapsed time per iteration (ms): 5616.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.936631E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:53:30.208616 | finish at 2025-09-10 11:46:34 + [2025-09-10 01:53:09] iteration 5581/ 11920 | consumed samples: 5714944 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.904370E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:54:12.865427 | finish at 2025-09-10 11:47:22 + [2025-09-10 01:53:15] iteration 5582/ 11920 | consumed samples: 5715968 | elapsed time per iteration (ms): 5635.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.895833E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:55:16.990273 | finish at 2025-09-10 11:48:32 + [2025-09-10 01:53:21] iteration 5583/ 11920 | consumed samples: 5716992 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.920647E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:53:44.386862 | finish at 2025-09-10 11:47:05 + [2025-09-10 01:53:26] iteration 5584/ 11920 | consumed samples: 5718016 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.916041E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:53:55.324631 | finish at 2025-09-10 11:47:22 + [2025-09-10 01:53:32] iteration 5585/ 11920 | consumed samples: 5719040 | elapsed time per iteration (ms): 5623.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.910395E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:53:41.621337 | finish at 2025-09-10 11:47:13 + [2025-09-10 01:53:38] iteration 5586/ 11920 | consumed samples: 5720064 | elapsed time per iteration (ms): 5997.9 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.926303E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:33:10.946321 | finish at 2025-09-10 12:26:49 + [2025-09-10 01:53:43] iteration 5587/ 11920 | consumed samples: 5721088 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.898768E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:53:44.458246 | finish at 2025-09-10 11:47:28 + [2025-09-10 01:53:49] iteration 5588/ 11920 | consumed samples: 5722112 | elapsed time per iteration (ms): 5626.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.904795E+00 | loss scale: 1.0 | grad norm: 0.132 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:53:46.952022 | finish at 2025-09-10 11:47:36 + [2025-09-10 01:53:55] iteration 5589/ 11920 | consumed samples: 5723136 | elapsed time per iteration (ms): 5633.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.912307E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:54:22.858950 | finish at 2025-09-10 11:48:18 + [2025-09-10 01:54:00] iteration 5590/ 11920 | consumed samples: 5724160 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.907995E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:53:14.897876 | finish at 2025-09-10 11:47:15 + [2025-09-10 01:54:06] iteration 5591/ 11920 | consumed samples: 5725184 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.915643E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:53:25.497403 | finish at 2025-09-10 11:47:31 + [2025-09-10 01:54:12] iteration 5592/ 11920 | consumed samples: 5726208 | elapsed time per iteration (ms): 6365.8 | throughput per GPU (TFLOP/s/GPU): 70.9 | MFU 7.17% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.912446E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:11:22.555485 | finish at 2025-09-10 13:05:35 + [2025-09-10 01:54:18] iteration 5593/ 11920 | consumed samples: 5727232 | elapsed time per iteration (ms): 5881.4 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.922879E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:20:11.867384 | finish at 2025-09-10 12:14:30 + [2025-09-10 01:54:24] iteration 5594/ 11920 | consumed samples: 5728256 | elapsed time per iteration (ms): 5627.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.899185E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:53:21.587906 | finish at 2025-09-10 11:47:45 + [2025-09-10 01:54:29] iteration 5595/ 11920 | consumed samples: 5729280 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.915901E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:52:51.287739 | finish at 2025-09-10 11:47:21 + [2025-09-10 01:54:35] iteration 5596/ 11920 | consumed samples: 5730304 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.921524E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:52:51.340533 | finish at 2025-09-10 11:47:26 + [2025-09-10 01:54:41] iteration 5597/ 11920 | consumed samples: 5731328 | elapsed time per iteration (ms): 5956.0 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.918209E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:27:39.787059 | finish at 2025-09-10 12:22:21 + [2025-09-10 01:54:47] iteration 5598/ 11920 | consumed samples: 5732352 | elapsed time per iteration (ms): 5953.5 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.905763E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:27:17.995552 | finish at 2025-09-10 12:22:05 + [2025-09-10 01:54:53] iteration 5599/ 11920 | consumed samples: 5733376 | elapsed time per iteration (ms): 5628.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.915854E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:52:57.630854 | finish at 2025-09-10 11:47:50 + [2025-09-10 01:54:58] iteration 5600/ 11920 | consumed samples: 5734400 | elapsed time per iteration (ms): 5629.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.910591E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:52:55.829659 | finish at 2025-09-10 11:47:54 + [2025-09-10 01:55:04] iteration 5601/ 11920 | consumed samples: 5735424 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.902639E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:52:04.018267 | finish at 2025-09-10 11:47:08 + [2025-09-10 01:55:09] iteration 5602/ 11920 | consumed samples: 5736448 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.928896E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:52:08.811244 | finish at 2025-09-10 11:47:18 + [2025-09-10 01:55:15] iteration 5603/ 11920 | consumed samples: 5737472 | elapsed time per iteration (ms): 5932.6 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.915110E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:24:36.265963 | finish at 2025-09-10 12:19:52 + [2025-09-10 01:55:21] iteration 5604/ 11920 | consumed samples: 5738496 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.916680E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:51:44.480044 | finish at 2025-09-10 11:47:06 + [2025-09-10 01:55:27] iteration 5605/ 11920 | consumed samples: 5739520 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.920982E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:52:17.905265 | finish at 2025-09-10 11:47:45 + [2025-09-10 01:55:32] iteration 5606/ 11920 | consumed samples: 5740544 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.889232E+00 | loss scale: 1.0 | grad norm: 0.229 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:51:39.642704 | finish at 2025-09-10 11:47:12 + [2025-09-10 01:55:38] iteration 5607/ 11920 | consumed samples: 5741568 | elapsed time per iteration (ms): 5968.8 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.908580E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:28:00.851202 | finish at 2025-09-10 12:23:39 + [2025-09-10 01:55:44] iteration 5608/ 11920 | consumed samples: 5742592 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930369E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:51:39.480034 | finish at 2025-09-10 11:47:23 + [2025-09-10 01:55:50] iteration 5609/ 11920 | consumed samples: 5743616 | elapsed time per iteration (ms): 5617.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.913929E+00 | loss scale: 1.0 | grad norm: 0.238 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:50:50.390805 | finish at 2025-09-10 11:46:40 + [2025-09-10 01:55:55] iteration 5610/ 11920 | consumed samples: 5744640 | elapsed time per iteration (ms): 5848.4 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.916967E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:15:03.600307 | finish at 2025-09-10 12:10:59 + [2025-09-10 01:56:01] iteration 5611/ 11920 | consumed samples: 5745664 | elapsed time per iteration (ms): 5622.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.934289E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:51:12.311526 | finish at 2025-09-10 11:47:13 + [2025-09-10 01:56:07] iteration 5612/ 11920 | consumed samples: 5746688 | elapsed time per iteration (ms): 5630.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.928826E+00 | loss scale: 1.0 | grad norm: 0.230 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:51:59.462441 | finish at 2025-09-10 11:48:06 + [2025-09-10 01:56:12] iteration 5613/ 11920 | consumed samples: 5747712 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.905653E+00 | loss scale: 1.0 | grad norm: 0.242 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:50:51.979644 | finish at 2025-09-10 11:47:04 + [2025-09-10 01:56:18] iteration 5614/ 11920 | consumed samples: 5748736 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.917164E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:51:00.940723 | finish at 2025-09-10 11:47:19 + [2025-09-10 01:56:23] iteration 5615/ 11920 | consumed samples: 5749760 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930369E+00 | loss scale: 1.0 | grad norm: 0.253 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:50:55.015209 | finish at 2025-09-10 11:47:18 + [2025-09-10 01:56:29] iteration 5616/ 11920 | consumed samples: 5750784 | elapsed time per iteration (ms): 5844.1 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.910986E+00 | loss scale: 1.0 | grad norm: 0.256 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:14:01.060600 | finish at 2025-09-10 12:10:30 + [2025-09-10 01:56:35] iteration 5617/ 11920 | consumed samples: 5751808 | elapsed time per iteration (ms): 5917.9 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.907426E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:21:40.301831 | finish at 2025-09-10 12:18:16 + [2025-09-10 01:56:41] iteration 5618/ 11920 | consumed samples: 5752832 | elapsed time per iteration (ms): 5948.0 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.909086E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:24:44.014741 | finish at 2025-09-10 12:21:25 + [2025-09-10 01:56:47] iteration 5619/ 11920 | consumed samples: 5753856 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.906407E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:50:45.112510 | finish at 2025-09-10 11:47:32 + [2025-09-10 01:56:52] iteration 5620/ 11920 | consumed samples: 5754880 | elapsed time per iteration (ms): 5617.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.907125E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:49:49.359713 | finish at 2025-09-10 11:46:42 + [2025-09-10 01:56:58] iteration 5621/ 11920 | consumed samples: 5755904 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.904572E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:50:31.377905 | finish at 2025-09-10 11:47:29 + [2025-09-10 01:57:04] iteration 5622/ 11920 | consumed samples: 5756928 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.910794E+00 | loss scale: 1.0 | grad norm: 0.129 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:50:12.555771 | finish at 2025-09-10 11:47:16 + [2025-09-10 01:57:10] iteration 5623/ 11920 | consumed samples: 5757952 | elapsed time per iteration (ms): 5953.8 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.922744E+00 | loss scale: 1.0 | grad norm: 0.122 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:24:50.826145 | finish at 2025-09-10 12:22:00 + [2025-09-10 01:57:15] iteration 5624/ 11920 | consumed samples: 5758976 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.913122E+00 | loss scale: 1.0 | grad norm: 0.123 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:49:54.865969 | finish at 2025-09-10 11:47:10 + [2025-09-10 01:57:21] iteration 5625/ 11920 | consumed samples: 5760000 | elapsed time per iteration (ms): 5835.6 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.899654E+00 | loss scale: 1.0 | grad norm: 0.121 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:12:14.951282 | finish at 2025-09-10 12:09:36 + [2025-09-10 01:57:27] iteration 5626/ 11920 | consumed samples: 5761024 | elapsed time per iteration (ms): 5885.7 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.891362E+00 | loss scale: 1.0 | grad norm: 0.124 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:17:24.679755 | finish at 2025-09-10 12:14:52 + [2025-09-10 01:57:33] iteration 5627/ 11920 | consumed samples: 5762048 | elapsed time per iteration (ms): 6213.9 | throughput per GPU (TFLOP/s/GPU): 72.7 | MFU 7.35% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.889596E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:51:43.840705 | finish at 2025-09-10 12:49:17 + [2025-09-10 01:57:39] iteration 5628/ 11920 | consumed samples: 5763072 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.910272E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:49:53.815614 | finish at 2025-09-10 11:47:33 + [2025-09-10 01:57:45] iteration 5629/ 11920 | consumed samples: 5764096 | elapsed time per iteration (ms): 5926.7 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.906324E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:21:25.014414 | finish at 2025-09-10 12:19:10 + [2025-09-10 01:57:51] iteration 5630/ 11920 | consumed samples: 5765120 | elapsed time per iteration (ms): 5884.0 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.892428E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:16:50.537355 | finish at 2025-09-10 12:14:41 + [2025-09-10 01:57:56] iteration 5631/ 11920 | consumed samples: 5766144 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.899168E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:49:51.299879 | finish at 2025-09-10 11:47:48 + [2025-09-10 01:58:02] iteration 5632/ 11920 | consumed samples: 5767168 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.903144E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:49:07.127071 | finish at 2025-09-10 11:47:09 + [2025-09-10 01:58:08] iteration 5633/ 11920 | consumed samples: 5768192 | elapsed time per iteration (ms): 6099.7 | throughput per GPU (TFLOP/s/GPU): 74.0 | MFU 7.48% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.916801E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:39:08.980119 | finish at 2025-09-10 12:37:17 + [2025-09-10 01:58:14] iteration 5634/ 11920 | consumed samples: 5769216 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.907826E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:48:57.224184 | finish at 2025-09-10 11:47:11 + [2025-09-10 01:58:19] iteration 5635/ 11920 | consumed samples: 5770240 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.900048E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:49:11.034647 | finish at 2025-09-10 11:47:30 + [2025-09-10 01:58:25] iteration 5636/ 11920 | consumed samples: 5771264 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.909284E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:48:44.788447 | finish at 2025-09-10 11:47:10 + [2025-09-10 01:58:31] iteration 5637/ 11920 | consumed samples: 5772288 | elapsed time per iteration (ms): 5856.3 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.913572E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:13:14.973342 | finish at 2025-09-10 12:11:46 + [2025-09-10 01:58:36] iteration 5638/ 11920 | consumed samples: 5773312 | elapsed time per iteration (ms): 5627.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.909831E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:49:08.600410 | finish at 2025-09-10 11:47:45 + [2025-09-10 01:58:42] iteration 5639/ 11920 | consumed samples: 5774336 | elapsed time per iteration (ms): 5632.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.906098E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:49:40.056212 | finish at 2025-09-10 11:48:22 + [2025-09-10 01:58:48] iteration 5640/ 11920 | consumed samples: 5775360 | elapsed time per iteration (ms): 5626.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.908807E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:48:55.624619 | finish at 2025-09-10 11:47:43 + [2025-09-10 01:58:53] iteration 5641/ 11920 | consumed samples: 5776384 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.908795E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:48:44.500832 | finish at 2025-09-10 11:47:38 + [2025-09-10 01:58:59] iteration 5642/ 11920 | consumed samples: 5777408 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.899487E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:48:26.272028 | finish at 2025-09-10 11:47:25 + [2025-09-10 01:59:05] iteration 5643/ 11920 | consumed samples: 5778432 | elapsed time per iteration (ms): 6224.9 | throughput per GPU (TFLOP/s/GPU): 72.5 | MFU 7.33% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.918592E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:51:13.659932 | finish at 2025-09-10 12:50:19 + [2025-09-10 01:59:11] iteration 5644/ 11920 | consumed samples: 5779456 | elapsed time per iteration (ms): 5619.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.902542E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:47:46.729094 | finish at 2025-09-10 11:46:57 + [2025-09-10 01:59:17] iteration 5645/ 11920 | consumed samples: 5780480 | elapsed time per iteration (ms): 5845.7 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.905926E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:11:21.972623 | finish at 2025-09-10 12:10:39 + [2025-09-10 01:59:22] iteration 5646/ 11920 | consumed samples: 5781504 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.913546E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:47:44.290509 | finish at 2025-09-10 11:47:06 + [2025-09-10 01:59:28] iteration 5647/ 11920 | consumed samples: 5782528 | elapsed time per iteration (ms): 5628.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.915061E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:48:28.963834 | finish at 2025-09-10 11:47:57 + [2025-09-10 01:59:33] iteration 5648/ 11920 | consumed samples: 5783552 | elapsed time per iteration (ms): 5627.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.906607E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:48:13.666107 | finish at 2025-09-10 11:47:47 + [2025-09-10 01:59:39] iteration 5649/ 11920 | consumed samples: 5784576 | elapsed time per iteration (ms): 5626.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.903690E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:48:04.842355 | finish at 2025-09-10 11:47:44 + [2025-09-10 01:59:45] iteration 5650/ 11920 | consumed samples: 5785600 | elapsed time per iteration (ms): 5618.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.898729E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:47:05.334070 | finish at 2025-09-10 11:46:50 + [2025-09-10 01:59:51] iteration 5651/ 11920 | consumed samples: 5786624 | elapsed time per iteration (ms): 5861.2 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.897036E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:12:24.153352 | finish at 2025-09-10 12:12:15 + [2025-09-10 01:59:56] iteration 5652/ 11920 | consumed samples: 5787648 | elapsed time per iteration (ms): 5627.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.915174E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:47:54.878467 | finish at 2025-09-10 11:47:51 + [2025-09-10 02:00:02] iteration 5653/ 11920 | consumed samples: 5788672 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.890510E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:47:27.232616 | finish at 2025-09-10 11:47:29 + [2025-09-10 02:00:07] iteration 5654/ 11920 | consumed samples: 5789696 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.913900E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:47:00.739637 | finish at 2025-09-10 11:47:08 + [2025-09-10 02:00:13] iteration 5655/ 11920 | consumed samples: 5790720 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.922843E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:46:48.028151 | finish at 2025-09-10 11:47:01 + [2025-09-10 02:00:19] iteration 5656/ 11920 | consumed samples: 5791744 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.898693E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:46:57.635611 | finish at 2025-09-10 11:47:16 + [2025-09-10 02:00:24] iteration 5657/ 11920 | consumed samples: 5792768 | elapsed time per iteration (ms): 5617.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.910277E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:46:23.030069 | finish at 2025-09-10 11:46:47 + [2025-09-10 02:00:30] iteration 5658/ 11920 | consumed samples: 5793792 | elapsed time per iteration (ms): 5851.8 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.910856E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:10:44.275948 | finish at 2025-09-10 12:11:14 + [2025-09-10 02:00:36] iteration 5659/ 11920 | consumed samples: 5794816 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.893465E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:46:58.053349 | finish at 2025-09-10 11:47:34 + [2025-09-10 02:00:41] iteration 5660/ 11920 | consumed samples: 5795840 | elapsed time per iteration (ms): 5626.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.899588E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:47:03.293762 | finish at 2025-09-10 11:47:45 + [2025-09-10 02:00:47] iteration 5661/ 11920 | consumed samples: 5796864 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.894482E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:46:33.711758 | finish at 2025-09-10 11:47:21 + [2025-09-10 02:00:53] iteration 5662/ 11920 | consumed samples: 5797888 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.909901E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:46:21.628399 | finish at 2025-09-10 11:47:14 + [2025-09-10 02:00:58] iteration 5663/ 11920 | consumed samples: 5798912 | elapsed time per iteration (ms): 5631.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.902602E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:47:18.839029 | finish at 2025-09-10 11:48:17 + [2025-09-10 02:01:04] iteration 5664/ 11920 | consumed samples: 5799936 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.914247E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:46:33.928734 | finish at 2025-09-10 11:47:38 + [2025-09-10 02:01:09] iteration 5665/ 11920 | consumed samples: 5800960 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.911178E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:45:52.625048 | finish at 2025-09-10 11:47:02 + [2025-09-10 02:01:15] iteration 5666/ 11920 | consumed samples: 5801984 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.913164E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:45:59.227423 | finish at 2025-09-10 11:47:14 + [2025-09-10 02:01:21] iteration 5667/ 11920 | consumed samples: 5803008 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.905727E+00 | loss scale: 1.0 | grad norm: 0.240 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:46:17.871807 | finish at 2025-09-10 11:47:39 + [2025-09-10 02:01:27] iteration 5668/ 11920 | consumed samples: 5804032 | elapsed time per iteration (ms): 5981.9 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.916624E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:23:18.822258 | finish at 2025-09-10 12:24:46 + [2025-09-10 02:01:32] iteration 5669/ 11920 | consumed samples: 5805056 | elapsed time per iteration (ms): 5633.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.911667E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:46:54.991236 | finish at 2025-09-10 11:48:27 + [2025-09-10 02:01:38] iteration 5670/ 11920 | consumed samples: 5806080 | elapsed time per iteration (ms): 5628.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.890367E+00 | loss scale: 1.0 | grad norm: 0.244 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:46:16.910460 | finish at 2025-09-10 11:47:55 + [2025-09-10 02:01:44] iteration 5671/ 11920 | consumed samples: 5807104 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.907258E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:45:52.243007 | finish at 2025-09-10 11:47:36 + [2025-09-10 02:01:49] iteration 5672/ 11920 | consumed samples: 5808128 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.905324E+00 | loss scale: 1.0 | grad norm: 0.245 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:45:14.291086 | finish at 2025-09-10 11:47:04 + [2025-09-10 02:01:55] iteration 5673/ 11920 | consumed samples: 5809152 | elapsed time per iteration (ms): 5628.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.909000E+00 | loss scale: 1.0 | grad norm: 0.264 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:46:03.206903 | finish at 2025-09-10 11:47:58 + [2025-09-10 02:02:00] iteration 5674/ 11920 | consumed samples: 5810176 | elapsed time per iteration (ms): 5628.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.904597E+00 | loss scale: 1.0 | grad norm: 0.300 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:45:56.426966 | finish at 2025-09-10 11:47:57 + [2025-09-10 02:02:06] iteration 5675/ 11920 | consumed samples: 5811200 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.908450E+00 | loss scale: 1.0 | grad norm: 0.254 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:45:00.094516 | finish at 2025-09-10 11:47:06 + [2025-09-10 02:02:12] iteration 5676/ 11920 | consumed samples: 5812224 | elapsed time per iteration (ms): 5948.7 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.911533E+00 | loss scale: 1.0 | grad norm: 0.265 | num zeros: 12.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:19:03.648255 | finish at 2025-09-10 12:21:16 + [2025-09-10 02:02:18] iteration 5677/ 11920 | consumed samples: 5813248 | elapsed time per iteration (ms): 5614.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.908456E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:44:13.245371 | finish at 2025-09-10 11:46:31 + [2025-09-10 02:02:23] iteration 5678/ 11920 | consumed samples: 5814272 | elapsed time per iteration (ms): 5618.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.915120E+00 | loss scale: 1.0 | grad norm: 0.242 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:44:28.623236 | finish at 2025-09-10 11:46:52 + [2025-09-10 02:02:29] iteration 5679/ 11920 | consumed samples: 5815296 | elapsed time per iteration (ms): 5632.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.922891E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:45:49.045463 | finish at 2025-09-10 11:48:18 + [2025-09-10 02:02:35] iteration 5680/ 11920 | consumed samples: 5816320 | elapsed time per iteration (ms): 5630.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.903947E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:45:33.740273 | finish at 2025-09-10 11:48:08 + [2025-09-10 02:02:40] iteration 5681/ 11920 | consumed samples: 5817344 | elapsed time per iteration (ms): 5634.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.904579E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:45:51.149652 | finish at 2025-09-10 11:48:31 + [2025-09-10 02:02:46] iteration 5682/ 11920 | consumed samples: 5818368 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.914390E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:44:52.340234 | finish at 2025-09-10 11:47:38 + [2025-09-10 02:02:51] iteration 5683/ 11920 | consumed samples: 5819392 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.915889E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:44:13.427790 | finish at 2025-09-10 11:47:05 + [2025-09-10 02:02:57] iteration 5684/ 11920 | consumed samples: 5820416 | elapsed time per iteration (ms): 5985.3 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.902045E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:22:04.499206 | finish at 2025-09-10 12:25:02 + [2025-09-10 02:03:03] iteration 5685/ 11920 | consumed samples: 5821440 | elapsed time per iteration (ms): 5617.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.918679E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:43:45.396845 | finish at 2025-09-10 11:46:48 + [2025-09-10 02:03:09] iteration 5686/ 11920 | consumed samples: 5822464 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.907258E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:44:42.129644 | finish at 2025-09-10 11:47:51 + [2025-09-10 02:03:14] iteration 5687/ 11920 | consumed samples: 5823488 | elapsed time per iteration (ms): 5627.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.907335E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:44:35.674360 | finish at 2025-09-10 11:47:50 + [2025-09-10 02:03:20] iteration 5688/ 11920 | consumed samples: 5824512 | elapsed time per iteration (ms): 5638.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.914062E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:45:40.134777 | finish at 2025-09-10 11:49:00 + [2025-09-10 02:03:26] iteration 5689/ 11920 | consumed samples: 5825536 | elapsed time per iteration (ms): 5622.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.907757E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:43:54.181910 | finish at 2025-09-10 11:47:20 + [2025-09-10 02:03:31] iteration 5690/ 11920 | consumed samples: 5826560 | elapsed time per iteration (ms): 5623.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.914001E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:43:51.354773 | finish at 2025-09-10 11:47:23 + [2025-09-10 02:03:37] iteration 5691/ 11920 | consumed samples: 5827584 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.902160E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:44:05.246099 | finish at 2025-09-10 11:47:42 + [2025-09-10 02:03:42] iteration 5692/ 11920 | consumed samples: 5828608 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.906646E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:44:00.819732 | finish at 2025-09-10 11:47:43 + [2025-09-10 02:03:48] iteration 5693/ 11920 | consumed samples: 5829632 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.909611E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:43:30.080837 | finish at 2025-09-10 11:47:18 + [2025-09-10 02:03:54] iteration 5694/ 11920 | consumed samples: 5830656 | elapsed time per iteration (ms): 5630.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.919981E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:44:16.434593 | finish at 2025-09-10 11:48:10 + [2025-09-10 02:03:59] iteration 5695/ 11920 | consumed samples: 5831680 | elapsed time per iteration (ms): 5630.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.908800E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:44:12.399409 | finish at 2025-09-10 11:48:12 + [2025-09-10 02:04:05] iteration 5696/ 11920 | consumed samples: 5832704 | elapsed time per iteration (ms): 5619.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.897164E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:42:57.062973 | finish at 2025-09-10 11:47:02 + [2025-09-10 02:04:11] iteration 5697/ 11920 | consumed samples: 5833728 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.899612E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:43:01.299343 | finish at 2025-09-10 11:47:12 + [2025-09-10 02:04:16] iteration 5698/ 11920 | consumed samples: 5834752 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.915430E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:42:52.786826 | finish at 2025-09-10 11:47:09 + [2025-09-10 02:04:22] iteration 5699/ 11920 | consumed samples: 5835776 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.899147E+00 | loss scale: 1.0 | grad norm: 0.240 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:42:53.921984 | finish at 2025-09-10 11:47:16 + [2025-09-10 02:04:27] iteration 5700/ 11920 | consumed samples: 5836800 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.901518E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:43:03.283935 | finish at 2025-09-10 11:47:31 + [2025-09-10 02:04:33] iteration 5701/ 11920 | consumed samples: 5837824 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.898324E+00 | loss scale: 1.0 | grad norm: 0.238 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:42:56.178370 | finish at 2025-09-10 11:47:29 + [2025-09-10 02:04:39] iteration 5702/ 11920 | consumed samples: 5838848 | elapsed time per iteration (ms): 5631.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.894711E+00 | loss scale: 1.0 | grad norm: 0.257 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:43:33.300308 | finish at 2025-09-10 11:48:12 + [2025-09-10 02:04:44] iteration 5703/ 11920 | consumed samples: 5839872 | elapsed time per iteration (ms): 5625.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.892144E+00 | loss scale: 1.0 | grad norm: 0.246 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:42:56.422073 | finish at 2025-09-10 11:47:41 + [2025-09-10 02:04:50] iteration 5704/ 11920 | consumed samples: 5840896 | elapsed time per iteration (ms): 5626.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.902347E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:42:55.258472 | finish at 2025-09-10 11:47:45 + [2025-09-10 02:04:56] iteration 5705/ 11920 | consumed samples: 5841920 | elapsed time per iteration (ms): 5879.8 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.915237E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:09:02.836000 | finish at 2025-09-10 12:13:59 + [2025-09-10 02:05:01] iteration 5706/ 11920 | consumed samples: 5842944 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.907245E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:42:33.012197 | finish at 2025-09-10 11:47:34 + [2025-09-10 02:05:07] iteration 5707/ 11920 | consumed samples: 5843968 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.908080E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:42:04.953108 | finish at 2025-09-10 11:47:12 + [2025-09-10 02:05:13] iteration 5708/ 11920 | consumed samples: 5844992 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.904128E+00 | loss scale: 1.0 | grad norm: 0.306 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:42:16.161079 | finish at 2025-09-10 11:47:29 + [2025-09-10 02:05:18] iteration 5709/ 11920 | consumed samples: 5846016 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.900997E+00 | loss scale: 1.0 | grad norm: 0.269 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:42:18.188485 | finish at 2025-09-10 11:47:36 + [2025-09-10 02:05:24] iteration 5710/ 11920 | consumed samples: 5847040 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.899531E+00 | loss scale: 1.0 | grad norm: 0.243 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:41:53.789527 | finish at 2025-09-10 11:47:18 + [2025-09-10 02:05:30] iteration 5711/ 11920 | consumed samples: 5848064 | elapsed time per iteration (ms): 5626.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.902818E+00 | loss scale: 1.0 | grad norm: 0.264 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:42:14.073306 | finish at 2025-09-10 11:47:44 + [2025-09-10 02:05:35] iteration 5712/ 11920 | consumed samples: 5849088 | elapsed time per iteration (ms): 5627.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.906289E+00 | loss scale: 1.0 | grad norm: 0.247 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:42:14.821747 | finish at 2025-09-10 11:47:50 + [2025-09-10 02:05:41] iteration 5713/ 11920 | consumed samples: 5850112 | elapsed time per iteration (ms): 5936.0 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.898465E+00 | loss scale: 1.0 | grad norm: 0.241 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:14:04.495178 | finish at 2025-09-10 12:19:46 + [2025-09-10 02:05:47] iteration 5714/ 11920 | consumed samples: 5851136 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.901206E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:41:52.966933 | finish at 2025-09-10 11:47:40 + [2025-09-10 02:05:52] iteration 5715/ 11920 | consumed samples: 5852160 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.908578E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:41:35.124474 | finish at 2025-09-10 11:47:27 + [2025-09-10 02:05:58] iteration 5716/ 11920 | consumed samples: 5853184 | elapsed time per iteration (ms): 5632.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.901263E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:42:21.522429 | finish at 2025-09-10 11:48:20 + [2025-09-10 02:06:04] iteration 5717/ 11920 | consumed samples: 5854208 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.898456E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:41:36.307295 | finish at 2025-09-10 11:47:40 + [2025-09-10 02:06:09] iteration 5718/ 11920 | consumed samples: 5855232 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.899200E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:41:11.948284 | finish at 2025-09-10 11:47:21 + [2025-09-10 02:06:15] iteration 5719/ 11920 | consumed samples: 5856256 | elapsed time per iteration (ms): 5622.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.918294E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:41:05.153192 | finish at 2025-09-10 11:47:20 + [2025-09-10 02:06:20] iteration 5720/ 11920 | consumed samples: 5857280 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.906797E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:40:51.097584 | finish at 2025-09-10 11:47:12 + [2025-09-10 02:06:26] iteration 5721/ 11920 | consumed samples: 5858304 | elapsed time per iteration (ms): 5959.6 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.910720E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:15:43.684373 | finish at 2025-09-10 12:22:10 + [2025-09-10 02:06:32] iteration 5722/ 11920 | consumed samples: 5859328 | elapsed time per iteration (ms): 5628.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.894178E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:41:25.011412 | finish at 2025-09-10 11:47:57 + [2025-09-10 02:06:38] iteration 5723/ 11920 | consumed samples: 5860352 | elapsed time per iteration (ms): 5628.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.911242E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:41:22.038013 | finish at 2025-09-10 11:48:00 + [2025-09-10 02:06:43] iteration 5724/ 11920 | consumed samples: 5861376 | elapsed time per iteration (ms): 5618.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.893297E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:40:14.893863 | finish at 2025-09-10 11:46:58 + [2025-09-10 02:06:49] iteration 5725/ 11920 | consumed samples: 5862400 | elapsed time per iteration (ms): 5949.3 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.906023E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:14:16.095486 | finish at 2025-09-10 12:21:05 + [2025-09-10 02:06:55] iteration 5726/ 11920 | consumed samples: 5863424 | elapsed time per iteration (ms): 5633.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.898231E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:41:34.425343 | finish at 2025-09-10 11:48:29 + [2025-09-10 02:07:01] iteration 5727/ 11920 | consumed samples: 5864448 | elapsed time per iteration (ms): 5633.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.889664E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:41:27.829062 | finish at 2025-09-10 11:48:28 + [2025-09-10 02:07:06] iteration 5728/ 11920 | consumed samples: 5865472 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.910831E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:40:04.073433 | finish at 2025-09-10 11:47:10 + [2025-09-10 02:07:12] iteration 5729/ 11920 | consumed samples: 5866496 | elapsed time per iteration (ms): 5906.2 | throughput per GPU (TFLOP/s/GPU): 76.4 | MFU 7.73% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.904728E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:09:25.422528 | finish at 2025-09-10 12:16:37 + [2025-09-10 02:07:18] iteration 5730/ 11920 | consumed samples: 5867520 | elapsed time per iteration (ms): 5628.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.895463E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:40:39.014361 | finish at 2025-09-10 11:47:57 + [2025-09-10 02:07:23] iteration 5731/ 11920 | consumed samples: 5868544 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.897820E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:40:03.812661 | finish at 2025-09-10 11:47:27 + [2025-09-10 02:07:29] iteration 5732/ 11920 | consumed samples: 5869568 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.920536E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:40:00.744445 | finish at 2025-09-10 11:47:30 + [2025-09-10 02:07:35] iteration 5733/ 11920 | consumed samples: 5870592 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.910947E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:39:42.293104 | finish at 2025-09-10 11:47:17 + [2025-09-10 02:07:40] iteration 5734/ 11920 | consumed samples: 5871616 | elapsed time per iteration (ms): 5630.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.890143E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:40:32.522641 | finish at 2025-09-10 11:48:13 + [2025-09-10 02:07:46] iteration 5735/ 11920 | consumed samples: 5872640 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.909185E+00 | loss scale: 1.0 | grad norm: 0.253 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:39:51.751609 | finish at 2025-09-10 11:47:38 + [2025-09-10 02:07:51] iteration 5736/ 11920 | consumed samples: 5873664 | elapsed time per iteration (ms): 5619.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.905662E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:39:09.120951 | finish at 2025-09-10 11:47:01 + [2025-09-10 02:07:57] iteration 5737/ 11920 | consumed samples: 5874688 | elapsed time per iteration (ms): 5633.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.915958E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:40:30.355255 | finish at 2025-09-10 11:48:27 + [2025-09-10 02:08:03] iteration 5738/ 11920 | consumed samples: 5875712 | elapsed time per iteration (ms): 6189.5 | throughput per GPU (TFLOP/s/GPU): 72.9 | MFU 7.38% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.905612E+00 | loss scale: 1.0 | grad norm: 0.262 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:37:43.308288 | finish at 2025-09-10 12:45:47 + [2025-09-10 02:08:09] iteration 5739/ 11920 | consumed samples: 5876736 | elapsed time per iteration (ms): 5619.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.897705E+00 | loss scale: 1.0 | grad norm: 0.292 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:38:51.040214 | finish at 2025-09-10 11:47:00 + [2025-09-10 02:08:14] iteration 5740/ 11920 | consumed samples: 5877760 | elapsed time per iteration (ms): 5618.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.909609E+00 | loss scale: 1.0 | grad norm: 0.308 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:38:42.444892 | finish at 2025-09-10 11:46:57 + [2025-09-10 02:08:20] iteration 5741/ 11920 | consumed samples: 5878784 | elapsed time per iteration (ms): 5838.9 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.900997E+00 | loss scale: 1.0 | grad norm: 0.259 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:01:18.752220 | finish at 2025-09-10 12:09:39 + [2025-09-10 02:08:26] iteration 5742/ 11920 | consumed samples: 5879808 | elapsed time per iteration (ms): 5951.9 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.912305E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:12:50.914826 | finish at 2025-09-10 12:21:17 + [2025-09-10 02:08:33] iteration 5743/ 11920 | consumed samples: 5880832 | elapsed time per iteration (ms): 6367.4 | throughput per GPU (TFLOP/s/GPU): 70.9 | MFU 7.17% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.899247E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:55:31.485337 | finish at 2025-09-10 13:04:04 + [2025-09-10 02:08:39] iteration 5744/ 11920 | consumed samples: 5881856 | elapsed time per iteration (ms): 5989.9 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.899835E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:16:33.788010 | finish at 2025-09-10 12:25:12 + [2025-09-10 02:08:44] iteration 5745/ 11920 | consumed samples: 5882880 | elapsed time per iteration (ms): 5626.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.900712E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:39:06.397269 | finish at 2025-09-10 11:47:51 + [2025-09-10 02:08:51] iteration 5746/ 11920 | consumed samples: 5883904 | elapsed time per iteration (ms): 6258.6 | throughput per GPU (TFLOP/s/GPU): 72.1 | MFU 7.29% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.900187E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:44:00.831898 | finish at 2025-09-10 12:52:51 + [2025-09-10 02:08:56] iteration 5747/ 11920 | consumed samples: 5884928 | elapsed time per iteration (ms): 5889.1 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.898490E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:05:53.638469 | finish at 2025-09-10 12:14:50 + [2025-09-10 02:09:02] iteration 5748/ 11920 | consumed samples: 5885952 | elapsed time per iteration (ms): 5932.3 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.901968E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:10:13.947205 | finish at 2025-09-10 12:19:16 + [2025-09-10 02:09:08] iteration 5749/ 11920 | consumed samples: 5886976 | elapsed time per iteration (ms): 5622.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.891526E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:38:16.531010 | finish at 2025-09-10 11:47:25 + [2025-09-10 02:09:14] iteration 5750/ 11920 | consumed samples: 5888000 | elapsed time per iteration (ms): 5879.9 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.901560E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:04:38.823996 | finish at 2025-09-10 12:13:53 + [2025-09-10 02:09:19] iteration 5751/ 11920 | consumed samples: 5889024 | elapsed time per iteration (ms): 5619.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.907588E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:37:43.615154 | finish at 2025-09-10 11:47:03 + [2025-09-10 02:09:25] iteration 5752/ 11920 | consumed samples: 5890048 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.897321E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:38:04.866331 | finish at 2025-09-10 11:47:30 + [2025-09-10 02:09:31] iteration 5753/ 11920 | consumed samples: 5891072 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.888626E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:37:39.105370 | finish at 2025-09-10 11:47:10 + [2025-09-10 02:09:37] iteration 5754/ 11920 | consumed samples: 5892096 | elapsed time per iteration (ms): 6203.6 | throughput per GPU (TFLOP/s/GPU): 72.8 | MFU 7.36% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.895021E+00 | loss scale: 1.0 | grad norm: 0.286 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:37:31.288381 | finish at 2025-09-10 12:47:08 + [2025-09-10 02:09:43] iteration 5755/ 11920 | consumed samples: 5893120 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.906994E+00 | loss scale: 1.0 | grad norm: 0.293 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:37:34.363396 | finish at 2025-09-10 11:47:17 + [2025-09-10 02:09:48] iteration 5756/ 11920 | consumed samples: 5894144 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.895315E+00 | loss scale: 1.0 | grad norm: 0.254 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:37:30.067841 | finish at 2025-09-10 11:47:18 + [2025-09-10 02:09:54] iteration 5757/ 11920 | consumed samples: 5895168 | elapsed time per iteration (ms): 5619.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.895543E+00 | loss scale: 1.0 | grad norm: 0.257 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:37:11.410196 | finish at 2025-09-10 11:47:05 + [2025-09-10 02:10:00] iteration 5758/ 11920 | consumed samples: 5896192 | elapsed time per iteration (ms): 5856.6 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.912896E+00 | loss scale: 1.0 | grad norm: 0.308 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:01:28.100633 | finish at 2025-09-10 12:11:28 + [2025-09-10 02:10:06] iteration 5759/ 11920 | consumed samples: 5897216 | elapsed time per iteration (ms): 5894.5 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.899760E+00 | loss scale: 1.0 | grad norm: 0.276 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:05:16.196749 | finish at 2025-09-10 12:15:22 + [2025-09-10 02:10:11] iteration 5760/ 11920 | consumed samples: 5898240 | elapsed time per iteration (ms): 5959.4 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.909547E+00 | loss scale: 1.0 | grad norm: 0.250 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:11:49.675980 | finish at 2025-09-10 12:22:01 + [2025-09-10 02:10:17] iteration 5761/ 11920 | consumed samples: 5899264 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.903893E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:37:17.237010 | finish at 2025-09-10 11:47:34 + [2025-09-10 02:10:23] iteration 5762/ 11920 | consumed samples: 5900288 | elapsed time per iteration (ms): 5828.0 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.899645E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:58:08.787718 | finish at 2025-09-10 12:08:32 + [2025-09-10 02:10:29] iteration 5763/ 11920 | consumed samples: 5901312 | elapsed time per iteration (ms): 5985.0 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.897279E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:14:09.440310 | finish at 2025-09-10 12:24:38 + [2025-09-10 02:10:35] iteration 5764/ 11920 | consumed samples: 5902336 | elapsed time per iteration (ms): 5627.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.908261E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:37:22.392800 | finish at 2025-09-10 11:47:57 + [2025-09-10 02:10:40] iteration 5765/ 11920 | consumed samples: 5903360 | elapsed time per iteration (ms): 5626.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.911789E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:37:13.808436 | finish at 2025-09-10 11:47:54 + [2025-09-10 02:10:46] iteration 5766/ 11920 | consumed samples: 5904384 | elapsed time per iteration (ms): 5626.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.894917E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:37:05.967451 | finish at 2025-09-10 11:47:52 + [2025-09-10 02:10:51] iteration 5767/ 11920 | consumed samples: 5905408 | elapsed time per iteration (ms): 5625.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.905208E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:36:55.892959 | finish at 2025-09-10 11:47:47 + [2025-09-10 02:10:57] iteration 5768/ 11920 | consumed samples: 5906432 | elapsed time per iteration (ms): 5645.5 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.898267E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:38:50.986586 | finish at 2025-09-10 11:49:48 + [2025-09-10 02:11:03] iteration 5769/ 11920 | consumed samples: 5907456 | elapsed time per iteration (ms): 5630.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.901572E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:37:10.742240 | finish at 2025-09-10 11:48:13 + [2025-09-10 02:11:08] iteration 5770/ 11920 | consumed samples: 5908480 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.897431E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:36:35.229471 | finish at 2025-09-10 11:47:44 + [2025-09-10 02:11:14] iteration 5771/ 11920 | consumed samples: 5909504 | elapsed time per iteration (ms): 5619.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.907471E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:35:51.431589 | finish at 2025-09-10 11:47:05 + [2025-09-10 02:11:20] iteration 5772/ 11920 | consumed samples: 5910528 | elapsed time per iteration (ms): 5975.3 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.892908E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:12:16.313419 | finish at 2025-09-10 12:23:36 + [2025-09-10 02:11:26] iteration 5773/ 11920 | consumed samples: 5911552 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.901618E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:35:58.612669 | finish at 2025-09-10 11:47:24 + [2025-09-10 02:11:31] iteration 5774/ 11920 | consumed samples: 5912576 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.906963E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:36:07.862179 | finish at 2025-09-10 11:47:39 + [2025-09-10 02:11:37] iteration 5775/ 11920 | consumed samples: 5913600 | elapsed time per iteration (ms): 5628.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.902420E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:36:26.396935 | finish at 2025-09-10 11:48:03 + [2025-09-10 02:11:43] iteration 5776/ 11920 | consumed samples: 5914624 | elapsed time per iteration (ms): 5852.7 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.895676E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:59:18.777832 | finish at 2025-09-10 12:11:01 + [2025-09-10 02:11:48] iteration 5777/ 11920 | consumed samples: 5915648 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.894118E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:35:51.404780 | finish at 2025-09-10 11:47:40 + [2025-09-10 02:11:54] iteration 5778/ 11920 | consumed samples: 5916672 | elapsed time per iteration (ms): 5633.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.881087E+00 | loss scale: 1.0 | grad norm: 0.246 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:36:37.612995 | finish at 2025-09-10 11:48:32 + [2025-09-10 02:12:00] iteration 5779/ 11920 | consumed samples: 5917696 | elapsed time per iteration (ms): 5630.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.901217E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:36:17.719428 | finish at 2025-09-10 11:48:17 + [2025-09-10 02:12:05] iteration 5780/ 11920 | consumed samples: 5918720 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.897451E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:35:30.059047 | finish at 2025-09-10 11:47:35 + [2025-09-10 02:12:11] iteration 5781/ 11920 | consumed samples: 5919744 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.901448E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:35:01.281753 | finish at 2025-09-10 11:47:12 + [2025-09-10 02:12:17] iteration 5782/ 11920 | consumed samples: 5920768 | elapsed time per iteration (ms): 5966.5 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.898763E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:10:22.249952 | finish at 2025-09-10 12:22:39 + [2025-09-10 02:12:22] iteration 5783/ 11920 | consumed samples: 5921792 | elapsed time per iteration (ms): 5630.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.903420E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:35:52.228113 | finish at 2025-09-10 11:48:15 + [2025-09-10 02:12:28] iteration 5784/ 11920 | consumed samples: 5922816 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.893812E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:34:54.546686 | finish at 2025-09-10 11:47:23 + [2025-09-10 02:12:34] iteration 5785/ 11920 | consumed samples: 5923840 | elapsed time per iteration (ms): 5627.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.907415E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:35:26.086324 | finish at 2025-09-10 11:48:00 + [2025-09-10 02:12:39] iteration 5786/ 11920 | consumed samples: 5924864 | elapsed time per iteration (ms): 5627.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.912093E+00 | loss scale: 1.0 | grad norm: 0.245 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:35:16.362251 | finish at 2025-09-10 11:47:56 + [2025-09-10 02:12:45] iteration 5787/ 11920 | consumed samples: 5925888 | elapsed time per iteration (ms): 5627.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.920007E+00 | loss scale: 1.0 | grad norm: 0.268 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:35:11.787994 | finish at 2025-09-10 11:47:57 + [2025-09-10 02:12:51] iteration 5788/ 11920 | consumed samples: 5926912 | elapsed time per iteration (ms): 5627.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.901753E+00 | loss scale: 1.0 | grad norm: 0.248 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:35:04.621299 | finish at 2025-09-10 11:47:55 + [2025-09-10 02:12:56] iteration 5789/ 11920 | consumed samples: 5927936 | elapsed time per iteration (ms): 5926.3 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.905327E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:05:34.297621 | finish at 2025-09-10 12:18:31 + [2025-09-10 02:13:02] iteration 5790/ 11920 | consumed samples: 5928960 | elapsed time per iteration (ms): 5627.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.900369E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:34:54.127328 | finish at 2025-09-10 11:47:56 + [2025-09-10 02:13:08] iteration 5791/ 11920 | consumed samples: 5929984 | elapsed time per iteration (ms): 5628.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.908774E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:34:57.441723 | finish at 2025-09-10 11:48:05 + [2025-09-10 02:13:13] iteration 5792/ 11920 | consumed samples: 5931008 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.891704E+00 | loss scale: 1.0 | grad norm: 0.230 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:34:11.593956 | finish at 2025-09-10 11:47:25 + [2025-09-10 02:13:19] iteration 5793/ 11920 | consumed samples: 5932032 | elapsed time per iteration (ms): 5618.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.895688E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:33:47.203722 | finish at 2025-09-10 11:47:06 + [2025-09-10 02:13:25] iteration 5794/ 11920 | consumed samples: 5933056 | elapsed time per iteration (ms): 5629.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.878122E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:34:43.878801 | finish at 2025-09-10 11:48:08 + [2025-09-10 02:13:30] iteration 5795/ 11920 | consumed samples: 5934080 | elapsed time per iteration (ms): 5627.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.898116E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:34:27.796773 | finish at 2025-09-10 11:47:58 + [2025-09-10 02:13:36] iteration 5796/ 11920 | consumed samples: 5935104 | elapsed time per iteration (ms): 5634.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.878964E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:35:05.875274 | finish at 2025-09-10 11:48:42 + [2025-09-10 02:13:41] iteration 5797/ 11920 | consumed samples: 5936128 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.906452E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:33:46.164235 | finish at 2025-09-10 11:47:28 + [2025-09-10 02:13:47] iteration 5798/ 11920 | consumed samples: 5937152 | elapsed time per iteration (ms): 5615.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.895294E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:32:58.903833 | finish at 2025-09-10 11:46:46 + [2025-09-10 02:13:53] iteration 5799/ 11920 | consumed samples: 5938176 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.891479E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:33:34.459668 | finish at 2025-09-10 11:47:27 + [2025-09-10 02:13:58] iteration 5800/ 11920 | consumed samples: 5939200 | elapsed time per iteration (ms): 5636.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.905991E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:34:56.740637 | finish at 2025-09-10 11:48:55 + [2025-09-10 02:14:04] iteration 5801/ 11920 | consumed samples: 5940224 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.909767E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:33:31.615200 | finish at 2025-09-10 11:47:36 + [2025-09-10 02:14:10] iteration 5802/ 11920 | consumed samples: 5941248 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.893317E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:33:23.943531 | finish at 2025-09-10 11:47:34 + [2025-09-10 02:14:15] iteration 5803/ 11920 | consumed samples: 5942272 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.892120E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:33:22.925781 | finish at 2025-09-10 11:47:38 + [2025-09-10 02:14:21] iteration 5804/ 11920 | consumed samples: 5943296 | elapsed time per iteration (ms): 5618.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.905571E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:32:41.500690 | finish at 2025-09-10 11:47:02 + [2025-09-10 02:14:26] iteration 5805/ 11920 | consumed samples: 5944320 | elapsed time per iteration (ms): 5628.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.907941E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:33:35.519004 | finish at 2025-09-10 11:48:02 + [2025-09-10 02:14:32] iteration 5806/ 11920 | consumed samples: 5945344 | elapsed time per iteration (ms): 5630.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.893571E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:33:47.059642 | finish at 2025-09-10 11:48:19 + [2025-09-10 02:14:38] iteration 5807/ 11920 | consumed samples: 5946368 | elapsed time per iteration (ms): 5635.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.882217E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:34:11.359035 | finish at 2025-09-10 11:48:49 + [2025-09-10 02:14:43] iteration 5808/ 11920 | consumed samples: 5947392 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.893783E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:32:59.284508 | finish at 2025-09-10 11:47:43 + [2025-09-10 02:14:49] iteration 5809/ 11920 | consumed samples: 5948416 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.898780E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:32:44.616175 | finish at 2025-09-10 11:47:34 + [2025-09-10 02:14:55] iteration 5810/ 11920 | consumed samples: 5949440 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.890134E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:32:33.164365 | finish at 2025-09-10 11:47:28 + [2025-09-10 02:15:00] iteration 5811/ 11920 | consumed samples: 5950464 | elapsed time per iteration (ms): 5632.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.905543E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:33:26.498087 | finish at 2025-09-10 11:48:27 + [2025-09-10 02:15:06] iteration 5812/ 11920 | consumed samples: 5951488 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.902595E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:32:28.168282 | finish at 2025-09-10 11:47:34 + [2025-09-10 02:15:11] iteration 5813/ 11920 | consumed samples: 5952512 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.901876E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:32:04.899276 | finish at 2025-09-10 11:47:16 + [2025-09-10 02:15:17] iteration 5814/ 11920 | consumed samples: 5953536 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.892914E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:32:07.104987 | finish at 2025-09-10 11:47:24 + [2025-09-10 02:15:23] iteration 5815/ 11920 | consumed samples: 5954560 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.909952E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:31:55.024867 | finish at 2025-09-10 11:47:18 + [2025-09-10 02:15:28] iteration 5816/ 11920 | consumed samples: 5955584 | elapsed time per iteration (ms): 5630.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.906876E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:32:50.697226 | finish at 2025-09-10 11:48:19 + [2025-09-10 02:15:34] iteration 5817/ 11920 | consumed samples: 5956608 | elapsed time per iteration (ms): 5867.0 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.902261E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:56:46.020800 | finish at 2025-09-10 12:12:20 + [2025-09-10 02:15:40] iteration 5818/ 11920 | consumed samples: 5957632 | elapsed time per iteration (ms): 5634.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.883461E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:33:01.169237 | finish at 2025-09-10 11:48:41 + [2025-09-10 02:15:45] iteration 5819/ 11920 | consumed samples: 5958656 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.888659E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:31:54.690709 | finish at 2025-09-10 11:47:40 + [2025-09-10 02:15:51] iteration 5820/ 11920 | consumed samples: 5959680 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.893457E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:31:52.662888 | finish at 2025-09-10 11:47:44 + [2025-09-10 02:15:57] iteration 5821/ 11920 | consumed samples: 5960704 | elapsed time per iteration (ms): 5962.5 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.901804E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:06:05.045244 | finish at 2025-09-10 12:22:02 + [2025-09-10 02:16:03] iteration 5822/ 11920 | consumed samples: 5961728 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.891664E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:31:41.628008 | finish at 2025-09-10 11:47:44 + [2025-09-10 02:16:08] iteration 5823/ 11920 | consumed samples: 5962752 | elapsed time per iteration (ms): 5625.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.891107E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:31:37.905758 | finish at 2025-09-10 11:47:46 + [2025-09-10 02:16:14] iteration 5824/ 11920 | consumed samples: 5963776 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.892119E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:31:08.507126 | finish at 2025-09-10 11:47:22 + [2025-09-10 02:16:20] iteration 5825/ 11920 | consumed samples: 5964800 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.904719E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:31:29.154447 | finish at 2025-09-10 11:47:49 + [2025-09-10 02:16:25] iteration 5826/ 11920 | consumed samples: 5965824 | elapsed time per iteration (ms): 5627.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.906622E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:31:31.080956 | finish at 2025-09-10 11:47:56 + [2025-09-10 02:16:31] iteration 5827/ 11920 | consumed samples: 5966848 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.895072E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:31:19.824780 | finish at 2025-09-10 11:47:51 + [2025-09-10 02:16:36] iteration 5828/ 11920 | consumed samples: 5967872 | elapsed time per iteration (ms): 5636.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.904536E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:32:18.857219 | finish at 2025-09-10 11:48:55 + [2025-09-10 02:16:42] iteration 5829/ 11920 | consumed samples: 5968896 | elapsed time per iteration (ms): 5839.3 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.896533E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:52:47.441550 | finish at 2025-09-10 12:09:30 + [2025-09-10 02:16:48] iteration 5830/ 11920 | consumed samples: 5969920 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.884953E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:30:45.101781 | finish at 2025-09-10 11:47:33 + [2025-09-10 02:16:54] iteration 5831/ 11920 | consumed samples: 5970944 | elapsed time per iteration (ms): 5625.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.899524E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:30:55.836713 | finish at 2025-09-10 11:47:49 + [2025-09-10 02:17:00] iteration 5832/ 11920 | consumed samples: 5971968 | elapsed time per iteration (ms): 6012.8 | throughput per GPU (TFLOP/s/GPU): 75.1 | MFU 7.59% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.893640E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:10:06.100447 | finish at 2025-09-10 12:27:06 + [2025-09-10 02:17:05] iteration 5833/ 11920 | consumed samples: 5972992 | elapsed time per iteration (ms): 5945.7 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.889894E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:03:11.292617 | finish at 2025-09-10 12:20:17 + [2025-09-10 02:17:11] iteration 5834/ 11920 | consumed samples: 5974016 | elapsed time per iteration (ms): 5626.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.888172E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:30:45.529344 | finish at 2025-09-10 11:47:57 + [2025-09-10 02:17:17] iteration 5835/ 11920 | consumed samples: 5975040 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.895860E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:30:00.220754 | finish at 2025-09-10 11:47:17 + [2025-09-10 02:17:23] iteration 5836/ 11920 | consumed samples: 5976064 | elapsed time per iteration (ms): 5969.8 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.902983E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:05:20.530071 | finish at 2025-09-10 12:22:43 + [2025-09-10 02:17:28] iteration 5837/ 11920 | consumed samples: 5977088 | elapsed time per iteration (ms): 5628.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.896268E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:30:37.505521 | finish at 2025-09-10 11:48:06 + [2025-09-10 02:17:34] iteration 5838/ 11920 | consumed samples: 5978112 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.900084E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:30:14.408235 | finish at 2025-09-10 11:47:48 + [2025-09-10 02:17:40] iteration 5839/ 11920 | consumed samples: 5979136 | elapsed time per iteration (ms): 5633.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.897330E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:30:54.705871 | finish at 2025-09-10 11:48:34 + [2025-09-10 02:17:45] iteration 5840/ 11920 | consumed samples: 5980160 | elapsed time per iteration (ms): 5631.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.889915E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:30:41.437836 | finish at 2025-09-10 11:48:27 + [2025-09-10 02:17:51] iteration 5841/ 11920 | consumed samples: 5981184 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.895579E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:29:55.515636 | finish at 2025-09-10 11:47:46 + [2025-09-10 02:17:56] iteration 5842/ 11920 | consumed samples: 5982208 | elapsed time per iteration (ms): 5623.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.902250E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:29:36.497791 | finish at 2025-09-10 11:47:33 + [2025-09-10 02:18:02] iteration 5843/ 11920 | consumed samples: 5983232 | elapsed time per iteration (ms): 5629.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.896809E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:30:12.745692 | finish at 2025-09-10 11:48:15 + [2025-09-10 02:18:08] iteration 5844/ 11920 | consumed samples: 5984256 | elapsed time per iteration (ms): 5638.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.883918E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:30:56.528631 | finish at 2025-09-10 11:49:04 + [2025-09-10 02:18:13] iteration 5845/ 11920 | consumed samples: 5985280 | elapsed time per iteration (ms): 5629.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.898663E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:30:00.586492 | finish at 2025-09-10 11:48:14 + [2025-09-10 02:18:19] iteration 5846/ 11920 | consumed samples: 5986304 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.883469E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:29:24.232721 | finish at 2025-09-10 11:47:43 + [2025-09-10 02:18:25] iteration 5847/ 11920 | consumed samples: 5987328 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.888940E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:28:50.113065 | finish at 2025-09-10 11:47:15 + [2025-09-10 02:18:30] iteration 5848/ 11920 | consumed samples: 5988352 | elapsed time per iteration (ms): 5629.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.890968E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:29:42.002083 | finish at 2025-09-10 11:48:12 + [2025-09-10 02:18:36] iteration 5849/ 11920 | consumed samples: 5989376 | elapsed time per iteration (ms): 5633.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.890488E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:29:58.687806 | finish at 2025-09-10 11:48:35 + [2025-09-10 02:18:42] iteration 5850/ 11920 | consumed samples: 5990400 | elapsed time per iteration (ms): 5841.2 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.889725E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:50:55.827086 | finish at 2025-09-10 12:09:38 + [2025-09-10 02:18:47] iteration 5851/ 11920 | consumed samples: 5991424 | elapsed time per iteration (ms): 5626.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.901717E+00 | loss scale: 1.0 | grad norm: 0.246 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:29:07.268355 | finish at 2025-09-10 11:47:55 + [2025-09-10 02:18:53] iteration 5852/ 11920 | consumed samples: 5992448 | elapsed time per iteration (ms): 5617.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.915205E+00 | loss scale: 1.0 | grad norm: 0.253 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:28:07.171246 | finish at 2025-09-10 11:47:00 + [2025-09-10 02:18:59] iteration 5853/ 11920 | consumed samples: 5993472 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.898368E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:28:29.738486 | finish at 2025-09-10 11:47:28 + [2025-09-10 02:19:05] iteration 5854/ 11920 | consumed samples: 5994496 | elapsed time per iteration (ms): 6000.2 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.879827E+00 | loss scale: 1.0 | grad norm: 0.259 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:06:36.984894 | finish at 2025-09-10 12:25:42 + [2025-09-10 02:19:10] iteration 5855/ 11920 | consumed samples: 5995520 | elapsed time per iteration (ms): 5627.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.907067E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:28:50.446589 | finish at 2025-09-10 11:48:01 + [2025-09-10 02:19:17] iteration 5856/ 11920 | consumed samples: 5996544 | elapsed time per iteration (ms): 6561.1 | throughput per GPU (TFLOP/s/GPU): 68.8 | MFU 6.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.904116E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:03:06.588509 | finish at 2025-09-10 13:22:23 + [2025-09-10 02:19:23] iteration 5857/ 11920 | consumed samples: 5997568 | elapsed time per iteration (ms): 5961.4 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.899259E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:02:23.917801 | finish at 2025-09-10 12:21:47 + [2025-09-10 02:19:29] iteration 5858/ 11920 | consumed samples: 5998592 | elapsed time per iteration (ms): 5879.4 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.886467E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:54:00.730666 | finish at 2025-09-10 12:13:29 + [2025-09-10 02:19:35] iteration 5859/ 11920 | consumed samples: 5999616 | elapsed time per iteration (ms): 5926.9 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.884203E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:58:43.087750 | finish at 2025-09-10 12:18:18 + [2025-09-10 02:19:40] iteration 5860/ 11920 | consumed samples: 6000640 | elapsed time per iteration (ms): 5622.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.889959E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:27:52.862563 | finish at 2025-09-10 11:47:33 + [2025-09-10 02:19:46] iteration 5861/ 11920 | consumed samples: 6001664 | elapsed time per iteration (ms): 5617.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.880344E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:27:14.755749 | finish at 2025-09-10 11:47:01 + [2025-09-10 02:19:51] iteration 5862/ 11920 | consumed samples: 6002688 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.882772E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:27:33.354326 | finish at 2025-09-10 11:47:25 + [2025-09-10 02:19:57] iteration 5863/ 11920 | consumed samples: 6003712 | elapsed time per iteration (ms): 5633.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.897773E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:28:41.998905 | finish at 2025-09-10 11:48:39 + [2025-09-10 02:20:03] iteration 5864/ 11920 | consumed samples: 6004736 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.894446E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:27:24.638645 | finish at 2025-09-10 11:47:27 + [2025-09-10 02:20:09] iteration 5865/ 11920 | consumed samples: 6005760 | elapsed time per iteration (ms): 5964.2 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.899055E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:01:53.018907 | finish at 2025-09-10 12:22:02 + [2025-09-10 02:20:14] iteration 5866/ 11920 | consumed samples: 6006784 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.903902E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:27:23.271017 | finish at 2025-09-10 11:47:38 + [2025-09-10 02:20:20] iteration 5867/ 11920 | consumed samples: 6007808 | elapsed time per iteration (ms): 5864.0 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.899367E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:51:34.796826 | finish at 2025-09-10 12:11:55 + [2025-09-10 02:20:26] iteration 5868/ 11920 | consumed samples: 6008832 | elapsed time per iteration (ms): 5618.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.896822E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:26:43.284613 | finish at 2025-09-10 11:47:09 + [2025-09-10 02:20:31] iteration 5869/ 11920 | consumed samples: 6009856 | elapsed time per iteration (ms): 5628.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.884866E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:27:40.767071 | finish at 2025-09-10 11:48:12 + [2025-09-10 02:20:37] iteration 5870/ 11920 | consumed samples: 6010880 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.892399E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:26:44.778481 | finish at 2025-09-10 11:47:22 + [2025-09-10 02:20:43] iteration 5871/ 11920 | consumed samples: 6011904 | elapsed time per iteration (ms): 5626.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.894725E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:27:11.510594 | finish at 2025-09-10 11:47:54 + [2025-09-10 02:20:48] iteration 5872/ 11920 | consumed samples: 6012928 | elapsed time per iteration (ms): 5619.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.886317E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:26:28.052032 | finish at 2025-09-10 11:47:16 + [2025-09-10 02:20:54] iteration 5873/ 11920 | consumed samples: 6013952 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.896114E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:26:34.894518 | finish at 2025-09-10 11:47:29 + [2025-09-10 02:20:59] iteration 5874/ 11920 | consumed samples: 6014976 | elapsed time per iteration (ms): 5638.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.891359E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:28:07.299059 | finish at 2025-09-10 11:49:07 + [2025-09-10 02:21:05] iteration 5875/ 11920 | consumed samples: 6016000 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.900601E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:26:24.076127 | finish at 2025-09-10 11:47:29 + [2025-09-10 02:21:11] iteration 5876/ 11920 | consumed samples: 6017024 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.889579E+00 | loss scale: 1.0 | grad norm: 0.257 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:26:20.543731 | finish at 2025-09-10 11:47:31 + [2025-09-10 02:21:16] iteration 5877/ 11920 | consumed samples: 6018048 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.872482E+00 | loss scale: 1.0 | grad norm: 0.244 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:26:23.739009 | finish at 2025-09-10 11:47:40 + [2025-09-10 02:21:22] iteration 5878/ 11920 | consumed samples: 6019072 | elapsed time per iteration (ms): 5915.2 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.913830E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:55:39.586511 | finish at 2025-09-10 12:17:02 + [2025-09-10 02:21:28] iteration 5879/ 11920 | consumed samples: 6020096 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.896547E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:26:25.292969 | finish at 2025-09-10 11:47:53 + [2025-09-10 02:21:34] iteration 5880/ 11920 | consumed samples: 6021120 | elapsed time per iteration (ms): 5629.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.898039E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:26:43.923368 | finish at 2025-09-10 11:48:17 + [2025-09-10 02:21:39] iteration 5881/ 11920 | consumed samples: 6022144 | elapsed time per iteration (ms): 5630.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.902951E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:26:39.940722 | finish at 2025-09-10 11:48:19 + [2025-09-10 02:21:45] iteration 5882/ 11920 | consumed samples: 6023168 | elapsed time per iteration (ms): 5629.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.902607E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:26:31.660410 | finish at 2025-09-10 11:48:16 + [2025-09-10 02:21:51] iteration 5883/ 11920 | consumed samples: 6024192 | elapsed time per iteration (ms): 5858.2 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.895675E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:49:25.900587 | finish at 2025-09-10 12:11:17 + [2025-09-10 02:21:56] iteration 5884/ 11920 | consumed samples: 6025216 | elapsed time per iteration (ms): 5632.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.884709E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:26:34.941776 | finish at 2025-09-10 11:48:31 + [2025-09-10 02:22:02] iteration 5885/ 11920 | consumed samples: 6026240 | elapsed time per iteration (ms): 5631.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.911055E+00 | loss scale: 1.0 | grad norm: 0.263 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:26:27.827723 | finish at 2025-09-10 11:48:30 + [2025-09-10 02:22:08] iteration 5886/ 11920 | consumed samples: 6027264 | elapsed time per iteration (ms): 5629.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.900131E+00 | loss scale: 1.0 | grad norm: 0.246 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:26:10.171970 | finish at 2025-09-10 11:48:18 + [2025-09-10 02:22:13] iteration 5887/ 11920 | consumed samples: 6028288 | elapsed time per iteration (ms): 5628.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.884001E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:25:56.165056 | finish at 2025-09-10 11:48:09 + [2025-09-10 02:22:19] iteration 5888/ 11920 | consumed samples: 6029312 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.890712E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:25:22.091663 | finish at 2025-09-10 11:47:41 + [2025-09-10 02:22:25] iteration 5889/ 11920 | consumed samples: 6030336 | elapsed time per iteration (ms): 5966.4 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.892607E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:59:43.423989 | finish at 2025-09-10 12:22:08 + [2025-09-10 02:22:30] iteration 5890/ 11920 | consumed samples: 6031360 | elapsed time per iteration (ms): 5628.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.898600E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:25:37.732916 | finish at 2025-09-10 11:48:08 + [2025-09-10 02:22:36] iteration 5891/ 11920 | consumed samples: 6032384 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.882581E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:25:00.709955 | finish at 2025-09-10 11:47:37 + [2025-09-10 02:22:42] iteration 5892/ 11920 | consumed samples: 6033408 | elapsed time per iteration (ms): 5631.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.899958E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:25:48.794700 | finish at 2025-09-10 11:48:30 + [2025-09-10 02:22:48] iteration 5893/ 11920 | consumed samples: 6034432 | elapsed time per iteration (ms): 5938.7 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.879372E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:56:32.309004 | finish at 2025-09-10 12:19:20 + [2025-09-10 02:22:53] iteration 5894/ 11920 | consumed samples: 6035456 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.885290E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:24:59.245541 | finish at 2025-09-10 11:47:52 + [2025-09-10 02:22:59] iteration 5895/ 11920 | consumed samples: 6036480 | elapsed time per iteration (ms): 5641.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.897787E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:26:28.954377 | finish at 2025-09-10 11:49:28 + [2025-09-10 02:23:04] iteration 5896/ 11920 | consumed samples: 6037504 | elapsed time per iteration (ms): 5618.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.888064E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:24:07.725431 | finish at 2025-09-10 11:47:12 + [2025-09-10 02:23:10] iteration 5897/ 11920 | consumed samples: 6038528 | elapsed time per iteration (ms): 5619.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.885895E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:24:04.075368 | finish at 2025-09-10 11:47:14 + [2025-09-10 02:23:16] iteration 5898/ 11920 | consumed samples: 6039552 | elapsed time per iteration (ms): 5960.9 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.902911E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:58:16.670865 | finish at 2025-09-10 12:21:33 + [2025-09-10 02:23:22] iteration 5899/ 11920 | consumed samples: 6040576 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.877754E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:24:33.238316 | finish at 2025-09-10 11:47:55 + [2025-09-10 02:23:27] iteration 5900/ 11920 | consumed samples: 6041600 | elapsed time per iteration (ms): 5614.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.893366E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:23:21.286750 | finish at 2025-09-10 11:46:49 + [2025-09-10 02:23:33] iteration 5901/ 11920 | consumed samples: 6042624 | elapsed time per iteration (ms): 5629.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.883296E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:24:45.539953 | finish at 2025-09-10 11:48:18 + [2025-09-10 02:23:39] iteration 5902/ 11920 | consumed samples: 6043648 | elapsed time per iteration (ms): 6027.6 | throughput per GPU (TFLOP/s/GPU): 74.9 | MFU 7.57% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.887091E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:04:33.800097 | finish at 2025-09-10 12:28:13 + [2025-09-10 02:23:45] iteration 5903/ 11920 | consumed samples: 6044672 | elapsed time per iteration (ms): 5841.4 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.897171E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:45:47.660210 | finish at 2025-09-10 12:09:32 + [2025-09-10 02:23:51] iteration 5904/ 11920 | consumed samples: 6045696 | elapsed time per iteration (ms): 5954.3 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.887919E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:57:01.165894 | finish at 2025-09-10 12:20:52 + [2025-09-10 02:23:57] iteration 5905/ 11920 | consumed samples: 6046720 | elapsed time per iteration (ms): 6599.2 | throughput per GPU (TFLOP/s/GPU): 68.4 | MFU 6.92% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.890931E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 11:01:33.944267 | finish at 2025-09-10 13:25:31 + [2025-09-10 02:24:03] iteration 5906/ 11920 | consumed samples: 6047744 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.898567E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:23:56.776689 | finish at 2025-09-10 11:48:00 + [2025-09-10 02:24:09] iteration 5907/ 11920 | consumed samples: 6048768 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.868730E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:23:24.571208 | finish at 2025-09-10 11:47:33 + [2025-09-10 02:24:14] iteration 5908/ 11920 | consumed samples: 6049792 | elapsed time per iteration (ms): 5618.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.895219E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:23:00.703895 | finish at 2025-09-10 11:47:15 + [2025-09-10 02:24:20] iteration 5909/ 11920 | consumed samples: 6050816 | elapsed time per iteration (ms): 5931.0 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.892352E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:54:11.209437 | finish at 2025-09-10 12:18:31 + [2025-09-10 02:24:26] iteration 5910/ 11920 | consumed samples: 6051840 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.865483E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:23:00.013680 | finish at 2025-09-10 11:47:26 + [2025-09-10 02:24:31] iteration 5911/ 11920 | consumed samples: 6052864 | elapsed time per iteration (ms): 5628.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.890005E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:23:43.018865 | finish at 2025-09-10 11:48:14 + [2025-09-10 02:24:37] iteration 5912/ 11920 | consumed samples: 6053888 | elapsed time per iteration (ms): 5971.4 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.879696E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:57:56.330334 | finish at 2025-09-10 12:22:34 + [2025-09-10 02:24:43] iteration 5913/ 11920 | consumed samples: 6054912 | elapsed time per iteration (ms): 5839.0 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.884747E+00 | loss scale: 1.0 | grad norm: 0.131 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:44:35.030460 | finish at 2025-09-10 12:09:18 + [2025-09-10 02:24:49] iteration 5914/ 11920 | consumed samples: 6055936 | elapsed time per iteration (ms): 5618.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.889897E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:22:26.067015 | finish at 2025-09-10 11:47:15 + [2025-09-10 02:24:54] iteration 5915/ 11920 | consumed samples: 6056960 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.885510E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:22:40.953147 | finish at 2025-09-10 11:47:35 + [2025-09-10 02:25:00] iteration 5916/ 11920 | consumed samples: 6057984 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.875756E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:22:54.087495 | finish at 2025-09-10 11:47:54 + [2025-09-10 02:25:06] iteration 5917/ 11920 | consumed samples: 6059008 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.888144E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:22:29.518513 | finish at 2025-09-10 11:47:35 + [2025-09-10 02:25:11] iteration 5918/ 11920 | consumed samples: 6060032 | elapsed time per iteration (ms): 5618.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.883249E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:22:03.022578 | finish at 2025-09-10 11:47:14 + [2025-09-10 02:25:17] iteration 5919/ 11920 | consumed samples: 6061056 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.904282E+00 | loss scale: 1.0 | grad norm: 0.245 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:22:15.853467 | finish at 2025-09-10 11:47:33 + [2025-09-10 02:25:23] iteration 5920/ 11920 | consumed samples: 6062080 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.893599E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:22:29.017239 | finish at 2025-09-10 11:47:52 + [2025-09-10 02:25:29] iteration 5921/ 11920 | consumed samples: 6063104 | elapsed time per iteration (ms): 6259.0 | throughput per GPU (TFLOP/s/GPU): 72.1 | MFU 7.29% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.892104E+00 | loss scale: 1.0 | grad norm: 0.257 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:25:47.595490 | finish at 2025-09-10 12:51:16 + [2025-09-10 02:25:34] iteration 5922/ 11920 | consumed samples: 6064128 | elapsed time per iteration (ms): 5629.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.904500E+00 | loss scale: 1.0 | grad norm: 0.288 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:22:47.443645 | finish at 2025-09-10 11:48:22 + [2025-09-10 02:25:40] iteration 5923/ 11920 | consumed samples: 6065152 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.898133E+00 | loss scale: 1.0 | grad norm: 0.311 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:21:46.573684 | finish at 2025-09-10 11:47:27 + [2025-09-10 02:25:46] iteration 5924/ 11920 | consumed samples: 6066176 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.895655E+00 | loss scale: 1.0 | grad norm: 0.323 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:21:47.383263 | finish at 2025-09-10 11:47:33 + [2025-09-10 02:25:51] iteration 5925/ 11920 | consumed samples: 6067200 | elapsed time per iteration (ms): 5621.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.900857E+00 | loss scale: 1.0 | grad norm: 0.323 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:21:37.758094 | finish at 2025-09-10 11:47:29 + [2025-09-10 02:25:57] iteration 5926/ 11920 | consumed samples: 6068224 | elapsed time per iteration (ms): 5634.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.902108E+00 | loss scale: 1.0 | grad norm: 0.315 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:22:55.531137 | finish at 2025-09-10 11:48:52 + [2025-09-10 02:26:03] iteration 5927/ 11920 | consumed samples: 6069248 | elapsed time per iteration (ms): 5629.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.897944E+00 | loss scale: 1.0 | grad norm: 0.285 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:22:14.961046 | finish at 2025-09-10 11:48:18 + [2025-09-10 02:26:08] iteration 5928/ 11920 | consumed samples: 6070272 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.894170E+00 | loss scale: 1.0 | grad norm: 0.252 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:21:34.809767 | finish at 2025-09-10 11:47:43 + [2025-09-10 02:26:14] iteration 5929/ 11920 | consumed samples: 6071296 | elapsed time per iteration (ms): 5630.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.908060E+00 | loss scale: 1.0 | grad norm: 0.240 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:22:12.333110 | finish at 2025-09-10 11:48:26 + [2025-09-10 02:26:19] iteration 5930/ 11920 | consumed samples: 6072320 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.883352E+00 | loss scale: 1.0 | grad norm: 0.242 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:21:07.705243 | finish at 2025-09-10 11:47:27 + [2025-09-10 02:26:25] iteration 5931/ 11920 | consumed samples: 6073344 | elapsed time per iteration (ms): 5626.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.893698E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:21:35.243026 | finish at 2025-09-10 11:48:00 + [2025-09-10 02:26:31] iteration 5932/ 11920 | consumed samples: 6074368 | elapsed time per iteration (ms): 5824.8 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.84% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.915304E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:41:18.968159 | finish at 2025-09-10 12:07:50 + [2025-09-10 02:26:37] iteration 5933/ 11920 | consumed samples: 6075392 | elapsed time per iteration (ms): 5836.8 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.883741E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:42:25.166277 | finish at 2025-09-10 12:09:02 + [2025-09-10 02:26:42] iteration 5934/ 11920 | consumed samples: 6076416 | elapsed time per iteration (ms): 5631.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.890934E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:21:47.046369 | finish at 2025-09-10 11:48:29 + [2025-09-10 02:26:48] iteration 5935/ 11920 | consumed samples: 6077440 | elapsed time per iteration (ms): 5635.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.889879E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:22:11.002890 | finish at 2025-09-10 11:48:59 + [2025-09-10 02:26:54] iteration 5936/ 11920 | consumed samples: 6078464 | elapsed time per iteration (ms): 5623.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.892057E+00 | loss scale: 1.0 | grad norm: 0.142 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:20:47.820290 | finish at 2025-09-10 11:47:41 + [2025-09-10 02:26:59] iteration 5937/ 11920 | consumed samples: 6079488 | elapsed time per iteration (ms): 5630.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.909774E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:21:27.244880 | finish at 2025-09-10 11:48:27 + [2025-09-10 02:27:05] iteration 5938/ 11920 | consumed samples: 6080512 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.886598E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:20:42.493173 | finish at 2025-09-10 11:47:47 + [2025-09-10 02:27:11] iteration 5939/ 11920 | consumed samples: 6081536 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.876636E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:20:27.604617 | finish at 2025-09-10 11:47:38 + [2025-09-10 02:27:16] iteration 5940/ 11920 | consumed samples: 6082560 | elapsed time per iteration (ms): 5626.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.908498E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:20:46.559172 | finish at 2025-09-10 11:48:03 + [2025-09-10 02:27:22] iteration 5941/ 11920 | consumed samples: 6083584 | elapsed time per iteration (ms): 5846.7 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.893238E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:42:37.216598 | finish at 2025-09-10 12:09:59 + [2025-09-10 02:27:28] iteration 5942/ 11920 | consumed samples: 6084608 | elapsed time per iteration (ms): 5633.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.888805E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:21:15.159437 | finish at 2025-09-10 11:48:43 + [2025-09-10 02:27:33] iteration 5943/ 11920 | consumed samples: 6085632 | elapsed time per iteration (ms): 5630.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.892195E+00 | loss scale: 1.0 | grad norm: 0.123 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:20:53.541718 | finish at 2025-09-10 11:48:27 + [2025-09-10 02:27:39] iteration 5944/ 11920 | consumed samples: 6086656 | elapsed time per iteration (ms): 5999.5 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.899828E+00 | loss scale: 1.0 | grad norm: 0.129 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:57:32.787100 | finish at 2025-09-10 12:25:12 + [2025-09-10 02:27:45] iteration 5945/ 11920 | consumed samples: 6087680 | elapsed time per iteration (ms): 5630.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.889325E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:20:39.562660 | finish at 2025-09-10 11:48:24 + [2025-09-10 02:27:51] iteration 5946/ 11920 | consumed samples: 6088704 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.884608E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:19:59.219262 | finish at 2025-09-10 11:47:50 + [2025-09-10 02:27:56] iteration 5947/ 11920 | consumed samples: 6089728 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.876762E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:19:50.987540 | finish at 2025-09-10 11:47:47 + [2025-09-10 02:28:02] iteration 5948/ 11920 | consumed samples: 6090752 | elapsed time per iteration (ms): 5627.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.898767E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:20:07.101436 | finish at 2025-09-10 11:48:09 + [2025-09-10 02:28:07] iteration 5949/ 11920 | consumed samples: 6091776 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.894143E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:19:15.639851 | finish at 2025-09-10 11:47:23 + [2025-09-10 02:28:13] iteration 5950/ 11920 | consumed samples: 6092800 | elapsed time per iteration (ms): 5619.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.896973E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:19:06.931393 | finish at 2025-09-10 11:47:20 + [2025-09-10 02:28:19] iteration 5951/ 11920 | consumed samples: 6093824 | elapsed time per iteration (ms): 5614.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.886404E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:18:35.284691 | finish at 2025-09-10 11:46:54 + [2025-09-10 02:28:24] iteration 5952/ 11920 | consumed samples: 6094848 | elapsed time per iteration (ms): 5622.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.871868E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:19:15.947617 | finish at 2025-09-10 11:47:40 + [2025-09-10 02:28:30] iteration 5953/ 11920 | consumed samples: 6095872 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.887929E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:19:15.442220 | finish at 2025-09-10 11:47:45 + [2025-09-10 02:28:35] iteration 5954/ 11920 | consumed samples: 6096896 | elapsed time per iteration (ms): 5631.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.873003E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:19:57.827739 | finish at 2025-09-10 11:48:33 + [2025-09-10 02:28:41] iteration 5955/ 11920 | consumed samples: 6097920 | elapsed time per iteration (ms): 5628.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.875796E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:19:32.265943 | finish at 2025-09-10 11:48:13 + [2025-09-10 02:28:47] iteration 5956/ 11920 | consumed samples: 6098944 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.893665E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:18:54.192172 | finish at 2025-09-10 11:47:41 + [2025-09-10 02:28:52] iteration 5957/ 11920 | consumed samples: 6099968 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.878205E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:19:02.246061 | finish at 2025-09-10 11:47:55 + [2025-09-10 02:28:58] iteration 5958/ 11920 | consumed samples: 6100992 | elapsed time per iteration (ms): 5628.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.902149E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:19:17.883071 | finish at 2025-09-10 11:48:16 + [2025-09-10 02:29:04] iteration 5959/ 11920 | consumed samples: 6102016 | elapsed time per iteration (ms): 5623.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.899653E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:18:38.799085 | finish at 2025-09-10 11:47:42 + [2025-09-10 02:29:09] iteration 5960/ 11920 | consumed samples: 6103040 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.888640E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:18:39.607401 | finish at 2025-09-10 11:47:49 +(min, max) time across ranks (ms): + save-checkpoint ................................: (3718.69, 3718.73) + [2025-09-10 02:29:19] iteration 5961/ 11920 | consumed samples: 6104064 | elapsed time per iteration (ms): 5953.3 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.894700E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:51:15.785732 | finish at 2025-09-10 12:20:35 + [2025-09-10 02:29:25] iteration 5962/ 11920 | consumed samples: 6105088 | elapsed time per iteration (ms): 6216.7 | throughput per GPU (TFLOP/s/GPU): 72.6 | MFU 7.34% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.889488E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:17:19.171504 | finish at 2025-09-10 12:46:44 + [2025-09-10 02:29:31] iteration 5963/ 11920 | consumed samples: 6106112 | elapsed time per iteration (ms): 6314.2 | throughput per GPU (TFLOP/s/GPU): 71.5 | MFU 7.23% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.879642E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:26:53.785528 | finish at 2025-09-10 12:56:25 + [2025-09-10 02:29:37] iteration 5964/ 11920 | consumed samples: 6107136 | elapsed time per iteration (ms): 5932.2 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.882887E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:48:51.974189 | finish at 2025-09-10 12:18:29 + [2025-09-10 02:29:43] iteration 5965/ 11920 | consumed samples: 6108160 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.877748E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:18:12.418302 | finish at 2025-09-10 11:47:55 + [2025-09-10 02:29:49] iteration 5966/ 11920 | consumed samples: 6109184 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.884113E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:18:14.187037 | finish at 2025-09-10 11:48:03 + [2025-09-10 02:29:54] iteration 5967/ 11920 | consumed samples: 6110208 | elapsed time per iteration (ms): 5624.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.882854E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:18:03.301596 | finish at 2025-09-10 11:47:58 + [2025-09-10 02:30:00] iteration 5968/ 11920 | consumed samples: 6111232 | elapsed time per iteration (ms): 5865.7 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.881642E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:41:52.456284 | finish at 2025-09-10 12:11:53 + [2025-09-10 02:30:06] iteration 5969/ 11920 | consumed samples: 6112256 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.894083E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:17:38.911183 | finish at 2025-09-10 11:47:45 + [2025-09-10 02:30:11] iteration 5970/ 11920 | consumed samples: 6113280 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.881228E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:17:30.758016 | finish at 2025-09-10 11:47:42 + [2025-09-10 02:30:17] iteration 5971/ 11920 | consumed samples: 6114304 | elapsed time per iteration (ms): 5619.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.892482E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:17:09.015050 | finish at 2025-09-10 11:47:26 + [2025-09-10 02:30:23] iteration 5972/ 11920 | consumed samples: 6115328 | elapsed time per iteration (ms): 5615.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.891378E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:16:40.937117 | finish at 2025-09-10 11:47:04 + [2025-09-10 02:30:28] iteration 5973/ 11920 | consumed samples: 6116352 | elapsed time per iteration (ms): 5619.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.889226E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:16:59.730349 | finish at 2025-09-10 11:47:28 + [2025-09-10 02:30:34] iteration 5974/ 11920 | consumed samples: 6117376 | elapsed time per iteration (ms): 5625.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.902984E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:17:31.500927 | finish at 2025-09-10 11:48:05 + [2025-09-10 02:30:39] iteration 5975/ 11920 | consumed samples: 6118400 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.897515E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:17:28.263360 | finish at 2025-09-10 11:48:08 + [2025-09-10 02:30:45] iteration 5976/ 11920 | consumed samples: 6119424 | elapsed time per iteration (ms): 5918.5 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.882620E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:46:19.361029 | finish at 2025-09-10 12:17:05 + [2025-09-10 02:30:51] iteration 5977/ 11920 | consumed samples: 6120448 | elapsed time per iteration (ms): 6009.3 | throughput per GPU (TFLOP/s/GPU): 75.1 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.887821E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:55:13.254275 | finish at 2025-09-10 12:26:05 + [2025-09-10 02:30:57] iteration 5978/ 11920 | consumed samples: 6121472 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.888720E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:17:00.398128 | finish at 2025-09-10 11:47:57 + [2025-09-10 02:31:03] iteration 5979/ 11920 | consumed samples: 6122496 | elapsed time per iteration (ms): 5634.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.901646E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:17:53.662383 | finish at 2025-09-10 11:48:56 + [2025-09-10 02:31:09] iteration 5980/ 11920 | consumed samples: 6123520 | elapsed time per iteration (ms): 5980.6 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.888954E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:52:05.028119 | finish at 2025-09-10 12:23:14 + [2025-09-10 02:31:14] iteration 5981/ 11920 | consumed samples: 6124544 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.905199E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:16:19.706823 | finish at 2025-09-10 11:47:34 + [2025-09-10 02:31:20] iteration 5982/ 11920 | consumed samples: 6125568 | elapsed time per iteration (ms): 5616.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.893463E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:15:52.680567 | finish at 2025-09-10 11:47:13 + [2025-09-10 02:31:26] iteration 5983/ 11920 | consumed samples: 6126592 | elapsed time per iteration (ms): 5936.0 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.893826E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:47:22.134561 | finish at 2025-09-10 12:18:48 + [2025-09-10 02:31:31] iteration 5984/ 11920 | consumed samples: 6127616 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.896622E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:16:00.878345 | finish at 2025-09-10 11:47:32 + [2025-09-10 02:31:37] iteration 5985/ 11920 | consumed samples: 6128640 | elapsed time per iteration (ms): 5977.7 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.889636E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:51:17.698996 | finish at 2025-09-10 12:22:55 + [2025-09-10 02:31:43] iteration 5986/ 11920 | consumed samples: 6129664 | elapsed time per iteration (ms): 5629.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.883857E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:16:46.755488 | finish at 2025-09-10 11:48:30 + [2025-09-10 02:31:49] iteration 5987/ 11920 | consumed samples: 6130688 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.888343E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:15:57.683909 | finish at 2025-09-10 11:47:46 + [2025-09-10 02:31:55] iteration 5988/ 11920 | consumed samples: 6131712 | elapsed time per iteration (ms): 5836.3 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.901865E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:37:00.713182 | finish at 2025-09-10 12:08:55 + [2025-09-10 02:32:00] iteration 5989/ 11920 | consumed samples: 6132736 | elapsed time per iteration (ms): 5636.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.885091E+00 | loss scale: 1.0 | grad norm: 0.252 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:17:08.963691 | finish at 2025-09-10 11:49:09 + [2025-09-10 02:32:06] iteration 5990/ 11920 | consumed samples: 6133760 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.884363E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:15:35.052564 | finish at 2025-09-10 11:47:41 + [2025-09-10 02:32:11] iteration 5991/ 11920 | consumed samples: 6134784 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.890466E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:15:33.601211 | finish at 2025-09-10 11:47:45 + [2025-09-10 02:32:17] iteration 5992/ 11920 | consumed samples: 6135808 | elapsed time per iteration (ms): 5631.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.880768E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:16:23.691742 | finish at 2025-09-10 11:48:41 + [2025-09-10 02:32:23] iteration 5993/ 11920 | consumed samples: 6136832 | elapsed time per iteration (ms): 5828.8 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.880306E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 11.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:35:47.263466 | finish at 2025-09-10 12:08:10 + [2025-09-10 02:32:28] iteration 5994/ 11920 | consumed samples: 6137856 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.865008E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:15:15.367168 | finish at 2025-09-10 11:47:44 + [2025-09-10 02:32:34] iteration 5995/ 11920 | consumed samples: 6138880 | elapsed time per iteration (ms): 5627.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.887656E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:15:45.260203 | finish at 2025-09-10 11:48:19 + [2025-09-10 02:32:40] iteration 5996/ 11920 | consumed samples: 6139904 | elapsed time per iteration (ms): 5628.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.888803E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:15:41.482544 | finish at 2025-09-10 11:48:21 + [2025-09-10 02:32:45] iteration 5997/ 11920 | consumed samples: 6140928 | elapsed time per iteration (ms): 5627.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.888793E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:15:31.182937 | finish at 2025-09-10 11:48:17 + [2025-09-10 02:32:51] iteration 5998/ 11920 | consumed samples: 6141952 | elapsed time per iteration (ms): 5841.4 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.881969E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:36:32.516111 | finish at 2025-09-10 12:09:24 + [2025-09-10 02:32:57] iteration 5999/ 11920 | consumed samples: 6142976 | elapsed time per iteration (ms): 5626.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.883631E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:15:15.375449 | finish at 2025-09-10 11:48:12 + [2025-09-10 02:33:02] iteration 6000/ 11920 | consumed samples: 6144000 | elapsed time per iteration (ms): 5630.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.884417E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:15:34.251366 | finish at 2025-09-10 11:48:37 + [2025-09-10 02:33:08] iteration 6001/ 11920 | consumed samples: 6145024 | elapsed time per iteration (ms): 5637.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.880941E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:16:07.199954 | finish at 2025-09-10 11:49:15 + [2025-09-10 02:33:14] iteration 6002/ 11920 | consumed samples: 6146048 | elapsed time per iteration (ms): 5635.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.903482E+00 | loss scale: 1.0 | grad norm: 0.248 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:15:53.181541 | finish at 2025-09-10 11:49:07 + [2025-09-10 02:33:19] iteration 6003/ 11920 | consumed samples: 6147072 | elapsed time per iteration (ms): 5636.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.900925E+00 | loss scale: 1.0 | grad norm: 0.260 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:15:49.609541 | finish at 2025-09-10 11:49:09 + [2025-09-10 02:33:25] iteration 6004/ 11920 | consumed samples: 6148096 | elapsed time per iteration (ms): 5627.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.898469E+00 | loss scale: 1.0 | grad norm: 0.267 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:14:52.881331 | finish at 2025-09-10 11:48:18 + [2025-09-10 02:33:31] iteration 6005/ 11920 | consumed samples: 6149120 | elapsed time per iteration (ms): 5640.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.886407E+00 | loss scale: 1.0 | grad norm: 0.243 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:16:02.187147 | finish at 2025-09-10 11:49:33 + [2025-09-10 02:33:36] iteration 6006/ 11920 | consumed samples: 6150144 | elapsed time per iteration (ms): 5630.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.880520E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:14:57.564856 | finish at 2025-09-10 11:48:34 + [2025-09-10 02:33:42] iteration 6007/ 11920 | consumed samples: 6151168 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.890527E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:14:18.804988 | finish at 2025-09-10 11:48:01 + [2025-09-10 02:33:48] iteration 6008/ 11920 | consumed samples: 6152192 | elapsed time per iteration (ms): 5953.5 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.875286E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:46:36.869486 | finish at 2025-09-10 12:20:25 + [2025-09-10 02:33:53] iteration 6009/ 11920 | consumed samples: 6153216 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.893857E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:13:49.857712 | finish at 2025-09-10 11:47:43 + [2025-09-10 02:33:59] iteration 6010/ 11920 | consumed samples: 6154240 | elapsed time per iteration (ms): 5623.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.891385E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:13:51.918175 | finish at 2025-09-10 11:47:51 + [2025-09-10 02:34:05] iteration 6011/ 11920 | consumed samples: 6155264 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.872815E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:13:50.872418 | finish at 2025-09-10 11:47:56 + [2025-09-10 02:34:10] iteration 6012/ 11920 | consumed samples: 6156288 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.884369E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:13:23.000173 | finish at 2025-09-10 11:47:33 + [2025-09-10 02:34:16] iteration 6013/ 11920 | consumed samples: 6157312 | elapsed time per iteration (ms): 5617.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.877854E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:13:02.457411 | finish at 2025-09-10 11:47:18 + [2025-09-10 02:34:22] iteration 6014/ 11920 | consumed samples: 6158336 | elapsed time per iteration (ms): 6065.8 | throughput per GPU (TFLOP/s/GPU): 74.4 | MFU 7.53% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.880600E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:57:04.582128 | finish at 2025-09-10 12:31:27 + [2025-09-10 02:34:28] iteration 6015/ 11920 | consumed samples: 6159360 | elapsed time per iteration (ms): 5632.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.875465E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:14:17.282220 | finish at 2025-09-10 11:48:45 + [2025-09-10 02:34:33] iteration 6016/ 11920 | consumed samples: 6160384 | elapsed time per iteration (ms): 5617.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.893878E+00 | loss scale: 1.0 | grad norm: 0.276 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:12:43.341511 | finish at 2025-09-10 11:47:17 + [2025-09-10 02:34:39] iteration 6017/ 11920 | consumed samples: 6161408 | elapsed time per iteration (ms): 5642.7 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.884858E+00 | loss scale: 1.0 | grad norm: 0.242 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:15:09.106953 | finish at 2025-09-10 11:49:48 + [2025-09-10 02:34:45] iteration 6018/ 11920 | consumed samples: 6162432 | elapsed time per iteration (ms): 5628.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.899091E+00 | loss scale: 1.0 | grad norm: 0.229 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:13:41.699152 | finish at 2025-09-10 11:48:26 + [2025-09-10 02:34:50] iteration 6019/ 11920 | consumed samples: 6163456 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.896176E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:12:48.857243 | finish at 2025-09-10 11:47:39 + [2025-09-10 02:34:56] iteration 6020/ 11920 | consumed samples: 6164480 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.885278E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:12:58.903842 | finish at 2025-09-10 11:47:55 + [2025-09-10 02:35:01] iteration 6021/ 11920 | consumed samples: 6165504 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.888649E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:13:16.840834 | finish at 2025-09-10 11:48:18 + [2025-09-10 02:35:07] iteration 6022/ 11920 | consumed samples: 6166528 | elapsed time per iteration (ms): 5870.0 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.888881E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:37:01.422443 | finish at 2025-09-10 12:12:09 + [2025-09-10 02:35:13] iteration 6023/ 11920 | consumed samples: 6167552 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.897669E+00 | loss scale: 1.0 | grad norm: 0.247 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:12:41.729527 | finish at 2025-09-10 11:47:55 + [2025-09-10 02:35:19] iteration 6024/ 11920 | consumed samples: 6168576 | elapsed time per iteration (ms): 5625.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.896239E+00 | loss scale: 1.0 | grad norm: 0.263 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:12:47.589329 | finish at 2025-09-10 11:48:06 + [2025-09-10 02:35:24] iteration 6025/ 11920 | consumed samples: 6169600 | elapsed time per iteration (ms): 5629.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.872394E+00 | loss scale: 1.0 | grad norm: 0.249 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:13:04.101566 | finish at 2025-09-10 11:48:28 + [2025-09-10 02:35:30] iteration 6026/ 11920 | consumed samples: 6170624 | elapsed time per iteration (ms): 5634.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.883852E+00 | loss scale: 1.0 | grad norm: 0.267 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:13:27.477912 | finish at 2025-09-10 11:48:57 + [2025-09-10 02:35:35] iteration 6027/ 11920 | consumed samples: 6171648 | elapsed time per iteration (ms): 5633.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.890462E+00 | loss scale: 1.0 | grad norm: 0.255 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:13:18.238564 | finish at 2025-09-10 11:48:54 + [2025-09-10 02:35:41] iteration 6028/ 11920 | consumed samples: 6172672 | elapsed time per iteration (ms): 5641.9 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.877304E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:14:01.788597 | finish at 2025-09-10 11:49:43 + [2025-09-10 02:35:47] iteration 6029/ 11920 | consumed samples: 6173696 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.889914E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:12:20.171417 | finish at 2025-09-10 11:48:07 + [2025-09-10 02:35:52] iteration 6030/ 11920 | consumed samples: 6174720 | elapsed time per iteration (ms): 5630.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.881426E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:12:45.009022 | finish at 2025-09-10 11:48:37 + [2025-09-10 02:35:58] iteration 6031/ 11920 | consumed samples: 6175744 | elapsed time per iteration (ms): 5625.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.885958E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:12:08.159305 | finish at 2025-09-10 11:48:06 + [2025-09-10 02:36:04] iteration 6032/ 11920 | consumed samples: 6176768 | elapsed time per iteration (ms): 5628.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.884999E+00 | loss scale: 1.0 | grad norm: 0.251 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:12:18.587830 | finish at 2025-09-10 11:48:22 + [2025-09-10 02:36:09] iteration 6033/ 11920 | consumed samples: 6177792 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.890456E+00 | loss scale: 1.0 | grad norm: 0.263 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:11:31.655410 | finish at 2025-09-10 11:47:41 + [2025-09-10 02:36:15] iteration 6034/ 11920 | consumed samples: 6178816 | elapsed time per iteration (ms): 5626.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.875986E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:11:57.425397 | finish at 2025-09-10 11:48:12 + [2025-09-10 02:36:20] iteration 6035/ 11920 | consumed samples: 6179840 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.885605E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:11:27.794802 | finish at 2025-09-10 11:47:48 + [2025-09-10 02:36:26] iteration 6036/ 11920 | consumed samples: 6180864 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.902717E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:11:18.773290 | finish at 2025-09-10 11:47:45 + [2025-09-10 02:36:32] iteration 6037/ 11920 | consumed samples: 6181888 | elapsed time per iteration (ms): 5954.8 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.894195E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:43:52.303759 | finish at 2025-09-10 12:20:24 + [2025-09-10 02:36:38] iteration 6038/ 11920 | consumed samples: 6182912 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.887346E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:11:15.526015 | finish at 2025-09-10 11:47:53 + [2025-09-10 02:36:43] iteration 6039/ 11920 | consumed samples: 6183936 | elapsed time per iteration (ms): 5633.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.878648E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:12:07.766338 | finish at 2025-09-10 11:48:51 + [2025-09-10 02:36:49] iteration 6040/ 11920 | consumed samples: 6184960 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.875474E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:11:10.813923 | finish at 2025-09-10 11:48:00 + [2025-09-10 02:36:55] iteration 6041/ 11920 | consumed samples: 6185984 | elapsed time per iteration (ms): 5617.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.884158E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:10:27.212982 | finish at 2025-09-10 11:47:22 + [2025-09-10 02:37:00] iteration 6042/ 11920 | consumed samples: 6187008 | elapsed time per iteration (ms): 5642.6 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.902171E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:12:47.154100 | finish at 2025-09-10 11:49:47 + [2025-09-10 02:37:06] iteration 6043/ 11920 | consumed samples: 6188032 | elapsed time per iteration (ms): 5912.4 | throughput per GPU (TFLOP/s/GPU): 76.4 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.890831E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:39:07.358398 | finish at 2025-09-10 12:16:13 + [2025-09-10 02:37:12] iteration 6044/ 11920 | consumed samples: 6189056 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.888885E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:10:27.614968 | finish at 2025-09-10 11:47:39 + [2025-09-10 02:37:17] iteration 6045/ 11920 | consumed samples: 6190080 | elapsed time per iteration (ms): 5618.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.884236E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:10:05.548477 | finish at 2025-09-10 11:47:23 + [2025-09-10 02:37:23] iteration 6046/ 11920 | consumed samples: 6191104 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.879956E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:10:14.911347 | finish at 2025-09-10 11:47:38 + [2025-09-10 02:37:29] iteration 6047/ 11920 | consumed samples: 6192128 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.889353E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:10:13.950804 | finish at 2025-09-10 11:47:43 + [2025-09-10 02:37:34] iteration 6048/ 11920 | consumed samples: 6193152 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.897157E+00 | loss scale: 1.0 | grad norm: 0.262 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:10:01.175526 | finish at 2025-09-10 11:47:35 + [2025-09-10 02:37:40] iteration 6049/ 11920 | consumed samples: 6194176 | elapsed time per iteration (ms): 5626.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.883871E+00 | loss scale: 1.0 | grad norm: 0.309 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:10:31.197408 | finish at 2025-09-10 11:48:11 + [2025-09-10 02:37:45] iteration 6050/ 11920 | consumed samples: 6195200 | elapsed time per iteration (ms): 5629.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.888649E+00 | loss scale: 1.0 | grad norm: 0.298 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:10:47.214777 | finish at 2025-09-10 11:48:33 + [2025-09-10 02:37:51] iteration 6051/ 11920 | consumed samples: 6196224 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.884754E+00 | loss scale: 1.0 | grad norm: 0.290 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:09:50.028507 | finish at 2025-09-10 11:47:41 + [2025-09-10 02:37:57] iteration 6052/ 11920 | consumed samples: 6197248 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.894358E+00 | loss scale: 1.0 | grad norm: 0.295 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:10:10.310672 | finish at 2025-09-10 11:48:07 + [2025-09-10 02:38:03] iteration 6053/ 11920 | consumed samples: 6198272 | elapsed time per iteration (ms): 6001.6 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.886595E+00 | loss scale: 1.0 | grad norm: 0.305 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:46:51.113194 | finish at 2025-09-10 12:24:54 + [2025-09-10 02:38:08] iteration 6054/ 11920 | consumed samples: 6199296 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.890086E+00 | loss scale: 1.0 | grad norm: 0.318 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:09:54.559137 | finish at 2025-09-10 11:48:03 + [2025-09-10 02:38:14] iteration 6055/ 11920 | consumed samples: 6200320 | elapsed time per iteration (ms): 5634.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.884336E+00 | loss scale: 1.0 | grad norm: 0.292 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:10:48.286328 | finish at 2025-09-10 11:49:02 + [2025-09-10 02:38:20] iteration 6056/ 11920 | consumed samples: 6201344 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.897751E+00 | loss scale: 1.0 | grad norm: 0.337 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:09:24.948643 | finish at 2025-09-10 11:47:45 + [2025-09-10 02:38:25] iteration 6057/ 11920 | consumed samples: 6202368 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.891226E+00 | loss scale: 1.0 | grad norm: 0.346 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:09:19.026525 | finish at 2025-09-10 11:47:44 + [2025-09-10 02:38:31] iteration 6058/ 11920 | consumed samples: 6203392 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.882308E+00 | loss scale: 1.0 | grad norm: 0.245 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:09:08.707629 | finish at 2025-09-10 11:47:40 + [2025-09-10 02:38:36] iteration 6059/ 11920 | consumed samples: 6204416 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.897584E+00 | loss scale: 1.0 | grad norm: 0.245 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:09:27.417930 | finish at 2025-09-10 11:48:04 + [2025-09-10 02:38:42] iteration 6060/ 11920 | consumed samples: 6205440 | elapsed time per iteration (ms): 5630.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.886851E+00 | loss scale: 1.0 | grad norm: 0.279 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:09:55.190115 | finish at 2025-09-10 11:48:37 + [2025-09-10 02:38:48] iteration 6061/ 11920 | consumed samples: 6206464 | elapsed time per iteration (ms): 6189.2 | throughput per GPU (TFLOP/s/GPU): 72.9 | MFU 7.38% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.888359E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:04:22.469275 | finish at 2025-09-10 12:43:11 + [2025-09-10 02:38:54] iteration 6062/ 11920 | consumed samples: 6207488 | elapsed time per iteration (ms): 5628.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.900142E+00 | loss scale: 1.0 | grad norm: 0.252 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:09:29.410718 | finish at 2025-09-10 11:48:23 + [2025-09-10 02:39:00] iteration 6063/ 11920 | consumed samples: 6208512 | elapsed time per iteration (ms): 5630.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.896780E+00 | loss scale: 1.0 | grad norm: 0.265 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:09:36.153482 | finish at 2025-09-10 11:48:36 + [2025-09-10 02:39:05] iteration 6064/ 11920 | consumed samples: 6209536 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.885849E+00 | loss scale: 1.0 | grad norm: 0.284 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:09:01.022003 | finish at 2025-09-10 11:48:06 + [2025-09-10 02:39:11] iteration 6065/ 11920 | consumed samples: 6210560 | elapsed time per iteration (ms): 5627.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.874218E+00 | loss scale: 1.0 | grad norm: 0.296 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:09:06.511309 | finish at 2025-09-10 11:48:17 + [2025-09-10 02:39:16] iteration 6066/ 11920 | consumed samples: 6211584 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.895945E+00 | loss scale: 1.0 | grad norm: 0.230 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:08:24.717401 | finish at 2025-09-10 11:47:41 + [2025-09-10 02:39:22] iteration 6067/ 11920 | consumed samples: 6212608 | elapsed time per iteration (ms): 5629.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.907081E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:09:11.032883 | finish at 2025-09-10 11:48:33 + [2025-09-10 02:39:28] iteration 6068/ 11920 | consumed samples: 6213632 | elapsed time per iteration (ms): 5626.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.880595E+00 | loss scale: 1.0 | grad norm: 0.294 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:08:44.284982 | finish at 2025-09-10 11:48:12 + [2025-09-10 02:39:33] iteration 6069/ 11920 | consumed samples: 6214656 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.878279E+00 | loss scale: 1.0 | grad norm: 0.383 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:08:13.218442 | finish at 2025-09-10 11:47:46 + [2025-09-10 02:39:40] iteration 6070/ 11920 | consumed samples: 6215680 | elapsed time per iteration (ms): 6351.2 | throughput per GPU (TFLOP/s/GPU): 71.1 | MFU 7.19% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.878261E+00 | loss scale: 1.0 | grad norm: 0.364 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:19:14.305816 | finish at 2025-09-10 12:58:54 + [2025-09-10 02:39:45] iteration 6071/ 11920 | consumed samples: 6216704 | elapsed time per iteration (ms): 5634.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.894783E+00 | loss scale: 1.0 | grad norm: 0.292 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:09:17.435954 | finish at 2025-09-10 11:49:03 + [2025-09-10 02:39:51] iteration 6072/ 11920 | consumed samples: 6217728 | elapsed time per iteration (ms): 5629.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.887397E+00 | loss scale: 1.0 | grad norm: 0.326 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:08:43.589544 | finish at 2025-09-10 11:48:34 + [2025-09-10 02:39:57] iteration 6073/ 11920 | consumed samples: 6218752 | elapsed time per iteration (ms): 5853.8 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.897643E+00 | loss scale: 1.0 | grad norm: 0.475 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:30:27.164759 | finish at 2025-09-10 12:10:24 + [2025-09-10 02:40:03] iteration 6074/ 11920 | consumed samples: 6219776 | elapsed time per iteration (ms): 5832.1 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.893956E+00 | loss scale: 1.0 | grad norm: 0.440 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:28:14.474220 | finish at 2025-09-10 12:08:17 + [2025-09-10 02:40:08] iteration 6075/ 11920 | consumed samples: 6220800 | elapsed time per iteration (ms): 5628.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.886459E+00 | loss scale: 1.0 | grad norm: 0.249 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:08:15.922111 | finish at 2025-09-10 11:48:24 + [2025-09-10 02:40:14] iteration 6076/ 11920 | consumed samples: 6221824 | elapsed time per iteration (ms): 5990.1 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.888736E+00 | loss scale: 1.0 | grad norm: 0.276 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:43:26.060257 | finish at 2025-09-10 12:23:40 + [2025-09-10 02:40:20] iteration 6077/ 11920 | consumed samples: 6222848 | elapsed time per iteration (ms): 5635.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.890203E+00 | loss scale: 1.0 | grad norm: 0.324 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:08:49.651353 | finish at 2025-09-10 11:49:09 + [2025-09-10 02:40:25] iteration 6078/ 11920 | consumed samples: 6223872 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.891175E+00 | loss scale: 1.0 | grad norm: 0.465 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:07:41.142751 | finish at 2025-09-10 11:48:07 + [2025-09-10 02:40:31] iteration 6079/ 11920 | consumed samples: 6224896 | elapsed time per iteration (ms): 5968.4 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.889019E+00 | loss scale: 1.0 | grad norm: 0.386 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:41:01.694574 | finish at 2025-09-10 12:21:33 + [2025-09-10 02:40:37] iteration 6080/ 11920 | consumed samples: 6225920 | elapsed time per iteration (ms): 5634.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.886259E+00 | loss scale: 1.0 | grad norm: 0.955 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:08:25.158520 | finish at 2025-09-10 11:49:02 + [2025-09-10 02:40:43] iteration 6081/ 11920 | consumed samples: 6226944 | elapsed time per iteration (ms): 5873.0 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.920549E+00 | loss scale: 1.0 | grad norm: 1.952 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:31:32.712350 | finish at 2025-09-10 12:12:16 + [2025-09-10 02:40:49] iteration 6082/ 11920 | consumed samples: 6227968 | elapsed time per iteration (ms): 5637.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.914607E+00 | loss scale: 1.0 | grad norm: 0.296 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:08:34.151339 | finish at 2025-09-10 11:49:23 + [2025-09-10 02:40:55] iteration 6083/ 11920 | consumed samples: 6228992 | elapsed time per iteration (ms): 6284.5 | throughput per GPU (TFLOP/s/GPU): 71.8 | MFU 7.26% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930061E+00 | loss scale: 1.0 | grad norm: 0.884 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:11:22.897192 | finish at 2025-09-10 12:52:18 + [2025-09-10 02:41:01] iteration 6084/ 11920 | consumed samples: 6230016 | elapsed time per iteration (ms): 5661.3 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.936292E+00 | loss scale: 1.0 | grad norm: 2.364 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:10:39.164198 | finish at 2025-09-10 11:51:40 + [2025-09-10 02:41:06] iteration 6085/ 11920 | consumed samples: 6231040 | elapsed time per iteration (ms): 5634.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.928982E+00 | loss scale: 1.0 | grad norm: 1.011 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:07:54.418191 | finish at 2025-09-10 11:49:01 + [2025-09-10 02:41:12] iteration 6086/ 11920 | consumed samples: 6232064 | elapsed time per iteration (ms): 5639.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.949690E+00 | loss scale: 1.0 | grad norm: 2.691 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:08:19.941107 | finish at 2025-09-10 11:49:32 + [2025-09-10 02:41:17] iteration 6087/ 11920 | consumed samples: 6233088 | elapsed time per iteration (ms): 5651.3 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.923548E+00 | loss scale: 1.0 | grad norm: 1.019 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:09:24.004815 | finish at 2025-09-10 11:50:41 + [2025-09-10 02:41:23] iteration 6088/ 11920 | consumed samples: 6234112 | elapsed time per iteration (ms): 5638.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.920243E+00 | loss scale: 1.0 | grad norm: 0.606 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:08:03.367556 | finish at 2025-09-10 11:49:26 + [2025-09-10 02:41:29] iteration 6089/ 11920 | consumed samples: 6235136 | elapsed time per iteration (ms): 5637.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.915462E+00 | loss scale: 1.0 | grad norm: 0.933 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:07:52.058416 | finish at 2025-09-10 11:49:21 + [2025-09-10 02:41:34] iteration 6090/ 11920 | consumed samples: 6236160 | elapsed time per iteration (ms): 5648.5 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.924261E+00 | loss scale: 1.0 | grad norm: 0.747 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:08:50.633872 | finish at 2025-09-10 11:50:25 + [2025-09-10 02:41:40] iteration 6091/ 11920 | consumed samples: 6237184 | elapsed time per iteration (ms): 5640.5 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.915174E+00 | loss scale: 1.0 | grad norm: 0.795 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:07:58.623603 | finish at 2025-09-10 11:49:39 + [2025-09-10 02:41:46] iteration 6092/ 11920 | consumed samples: 6238208 | elapsed time per iteration (ms): 5868.4 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.918565E+00 | loss scale: 1.0 | grad norm: 1.723 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:30:01.233074 | finish at 2025-09-10 12:11:47 + [2025-09-10 02:41:52] iteration 6093/ 11920 | consumed samples: 6239232 | elapsed time per iteration (ms): 5956.6 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.935849E+00 | loss scale: 1.0 | grad norm: 0.960 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:38:29.063457 | finish at 2025-09-10 12:20:21 + [2025-09-10 02:41:58] iteration 6094/ 11920 | consumed samples: 6240256 | elapsed time per iteration (ms): 6002.3 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930127E+00 | loss scale: 1.0 | grad norm: 3.846 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:42:49.131858 | finish at 2025-09-10 12:24:47 + [2025-09-10 02:42:03] iteration 6095/ 11920 | consumed samples: 6241280 | elapsed time per iteration (ms): 5643.9 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.926090E+00 | loss scale: 1.0 | grad norm: 0.568 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:07:56.001722 | finish at 2025-09-10 11:49:59 + [2025-09-10 02:42:09] iteration 6096/ 11920 | consumed samples: 6242304 | elapsed time per iteration (ms): 5644.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.937101E+00 | loss scale: 1.0 | grad norm: 1.201 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:07:52.641937 | finish at 2025-09-10 11:50:02 + [2025-09-10 02:42:15] iteration 6097/ 11920 | consumed samples: 6243328 | elapsed time per iteration (ms): 5641.4 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.975708E+00 | loss scale: 1.0 | grad norm: 2.111 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:07:30.025490 | finish at 2025-09-10 11:49:45 + [2025-09-10 02:42:20] iteration 6098/ 11920 | consumed samples: 6244352 | elapsed time per iteration (ms): 5648.7 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.079475E+00 | loss scale: 1.0 | grad norm: 28.148 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:08:06.738334 | finish at 2025-09-10 11:50:27 + [2025-09-10 02:42:26] iteration 6099/ 11920 | consumed samples: 6245376 | elapsed time per iteration (ms): 5662.5 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.010037E+00 | loss scale: 1.0 | grad norm: 3.703 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:09:21.693675 | finish at 2025-09-10 11:51:48 + [2025-09-10 02:42:32] iteration 6100/ 11920 | consumed samples: 6246400 | elapsed time per iteration (ms): 5953.2 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.993254E+00 | loss scale: 1.0 | grad norm: 0.805 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:37:27.560763 | finish at 2025-09-10 12:20:00 + [2025-09-10 02:42:38] iteration 6101/ 11920 | consumed samples: 6247424 | elapsed time per iteration (ms): 5649.7 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.972055E+00 | loss scale: 1.0 | grad norm: 1.833 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:07:55.599710 | finish at 2025-09-10 11:50:33 + [2025-09-10 02:42:43] iteration 6102/ 11920 | consumed samples: 6248448 | elapsed time per iteration (ms): 5672.1 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.082627E+00 | loss scale: 1.0 | grad norm: 8.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:10:00.126995 | finish at 2025-09-10 11:52:43 + [2025-09-10 02:42:49] iteration 6103/ 11920 | consumed samples: 6249472 | elapsed time per iteration (ms): 5784.2 | throughput per GPU (TFLOP/s/GPU): 78.1 | MFU 7.89% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.574707E+00 | loss scale: 1.0 | grad norm: 86.268 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:20:46.828428 | finish at 2025-09-10 12:03:36 + [2025-09-10 02:42:55] iteration 6104/ 11920 | consumed samples: 6250496 | elapsed time per iteration (ms): 5655.3 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.143564E+00 | loss scale: 1.0 | grad norm: 1.426 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:08:11.258896 | finish at 2025-09-10 11:51:06 + [2025-09-10 02:43:00] iteration 6105/ 11920 | consumed samples: 6251520 | elapsed time per iteration (ms): 5654.1 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.049000E+00 | loss scale: 1.0 | grad norm: 0.663 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:07:58.521838 | finish at 2025-09-10 11:50:59 + [2025-09-10 02:43:06] iteration 6106/ 11920 | consumed samples: 6252544 | elapsed time per iteration (ms): 5665.5 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.067147E+00 | loss scale: 1.0 | grad norm: 5.048 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:08:59.331619 | finish at 2025-09-10 11:52:05 + [2025-09-10 02:43:12] iteration 6107/ 11920 | consumed samples: 6253568 | elapsed time per iteration (ms): 5699.0 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.211339E+00 | loss scale: 1.0 | grad norm: 5.626 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:12:08.319575 | finish at 2025-09-10 11:55:20 + [2025-09-10 02:43:18] iteration 6108/ 11920 | consumed samples: 6254592 | elapsed time per iteration (ms): 5771.6 | throughput per GPU (TFLOP/s/GPU): 78.2 | MFU 7.91% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.319341E+00 | loss scale: 1.0 | grad norm: 81.225 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:19:04.424234 | finish at 2025-09-10 12:02:22 + [2025-09-10 02:43:23] iteration 6109/ 11920 | consumed samples: 6255616 | elapsed time per iteration (ms): 5685.0 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.240247E+00 | loss scale: 1.0 | grad norm: 0.935 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:10:35.782663 | finish at 2025-09-10 11:53:59 + [2025-09-10 02:43:29] iteration 6110/ 11920 | consumed samples: 6256640 | elapsed time per iteration (ms): 5809.7 | throughput per GPU (TFLOP/s/GPU): 77.7 | MFU 7.86% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.214115E+00 | loss scale: 1.0 | grad norm: 63.535 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:22:34.131281 | finish at 2025-09-10 12:06:03 + [2025-09-10 02:43:35] iteration 6111/ 11920 | consumed samples: 6257664 | elapsed time per iteration (ms): 5698.6 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.337505E+00 | loss scale: 1.0 | grad norm: 2.281 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:11:43.012595 | finish at 2025-09-10 11:55:18 + [2025-09-10 02:43:40] iteration 6112/ 11920 | consumed samples: 6258688 | elapsed time per iteration (ms): 5709.8 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 8.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.383989E+00 | loss scale: 1.0 | grad norm: 5.398 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:12:42.719215 | finish at 2025-09-10 11:56:23 + [2025-09-10 02:43:46] iteration 6113/ 11920 | consumed samples: 6259712 | elapsed time per iteration (ms): 5686.7 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.239930E+00 | loss scale: 1.0 | grad norm: 1.569 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:10:22.423842 | finish at 2025-09-10 11:54:09 + [2025-09-10 02:43:52] iteration 6114/ 11920 | consumed samples: 6260736 | elapsed time per iteration (ms): 5684.3 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.144969E+00 | loss scale: 1.0 | grad norm: 0.408 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:10:03.250356 | finish at 2025-09-10 11:53:55 + [2025-09-10 02:43:58] iteration 6115/ 11920 | consumed samples: 6261760 | elapsed time per iteration (ms): 5682.9 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.161455E+00 | loss scale: 1.0 | grad norm: 1.466 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:09:49.308958 | finish at 2025-09-10 11:53:47 + [2025-09-10 02:44:03] iteration 6116/ 11920 | consumed samples: 6262784 | elapsed time per iteration (ms): 5721.8 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.625835E+00 | loss scale: 1.0 | grad norm: 12.590 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:13:29.486851 | finish at 2025-09-10 11:57:33 + [2025-09-10 02:44:09] iteration 6117/ 11920 | consumed samples: 6263808 | elapsed time per iteration (ms): 5704.5 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 8.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.319118E+00 | loss scale: 1.0 | grad norm: 5.978 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:11:43.058311 | finish at 2025-09-10 11:55:52 + [2025-09-10 02:44:15] iteration 6118/ 11920 | consumed samples: 6264832 | elapsed time per iteration (ms): 5739.8 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.713917E+00 | loss scale: 1.0 | grad norm: 8.367 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:15:02.327763 | finish at 2025-09-10 11:59:17 + [2025-09-10 02:44:20] iteration 6119/ 11920 | consumed samples: 6265856 | elapsed time per iteration (ms): 5757.1 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.811442E+00 | loss scale: 1.0 | grad norm: 12.361 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:16:37.159001 | finish at 2025-09-10 12:00:58 + [2025-09-10 02:44:26] iteration 6120/ 11920 | consumed samples: 6266880 | elapsed time per iteration (ms): 6029.9 | throughput per GPU (TFLOP/s/GPU): 74.9 | MFU 7.57% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.452429E+00 | loss scale: 1.0 | grad norm: 0.815 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:42:53.618174 | finish at 2025-09-10 12:27:20 + [2025-09-10 02:44:33] iteration 6121/ 11920 | consumed samples: 6267904 | elapsed time per iteration (ms): 6297.2 | throughput per GPU (TFLOP/s/GPU): 71.7 | MFU 7.25% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.301114E+00 | loss scale: 1.0 | grad norm: 0.841 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 10:08:37.715608 | finish at 2025-09-10 12:53:10 + [2025-09-10 02:44:38] iteration 6122/ 11920 | consumed samples: 6268928 | elapsed time per iteration (ms): 5708.3 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 8.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.253606E+00 | loss scale: 1.0 | grad norm: 1.166 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:11:36.706089 | finish at 2025-09-10 11:56:15 + [2025-09-10 02:44:44] iteration 6123/ 11920 | consumed samples: 6269952 | elapsed time per iteration (ms): 5721.6 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.363042E+00 | loss scale: 1.0 | grad norm: 4.105 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:12:48.134872 | finish at 2025-09-10 11:57:32 + [2025-09-10 02:44:50] iteration 6124/ 11920 | consumed samples: 6270976 | elapsed time per iteration (ms): 5704.2 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 8.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.290515E+00 | loss scale: 1.0 | grad norm: 0.761 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:11:01.582063 | finish at 2025-09-10 11:55:51 + [2025-09-10 02:44:56] iteration 6125/ 11920 | consumed samples: 6272000 | elapsed time per iteration (ms): 6048.2 | throughput per GPU (TFLOP/s/GPU): 74.6 | MFU 7.55% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.588762E+00 | loss scale: 1.0 | grad norm: 15.926 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:44:09.247911 | finish at 2025-09-10 12:29:05 + [2025-09-10 02:45:02] iteration 6126/ 11920 | consumed samples: 6273024 | elapsed time per iteration (ms): 5995.6 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.405668E+00 | loss scale: 1.0 | grad norm: 1.180 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:38:58.333639 | finish at 2025-09-10 12:24:00 + [2025-09-10 02:45:08] iteration 6127/ 11920 | consumed samples: 6274048 | elapsed time per iteration (ms): 5730.9 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.356644E+00 | loss scale: 1.0 | grad norm: 1.570 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:13:19.278011 | finish at 2025-09-10 11:58:27 + [2025-09-10 02:45:13] iteration 6128/ 11920 | consumed samples: 6275072 | elapsed time per iteration (ms): 5686.9 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.296104E+00 | loss scale: 1.0 | grad norm: 0.756 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:08:58.734123 | finish at 2025-09-10 11:54:12 + [2025-09-10 02:45:19] iteration 6129/ 11920 | consumed samples: 6276096 | elapsed time per iteration (ms): 5695.6 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.263097E+00 | loss scale: 1.0 | grad norm: 0.760 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:09:43.382710 | finish at 2025-09-10 11:55:02 + [2025-09-10 02:45:25] iteration 6130/ 11920 | consumed samples: 6277120 | elapsed time per iteration (ms): 5899.0 | throughput per GPU (TFLOP/s/GPU): 76.5 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.255255E+00 | loss scale: 1.0 | grad norm: 0.726 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:29:15.241342 | finish at 2025-09-10 12:14:40 + [2025-09-10 02:45:31] iteration 6131/ 11920 | consumed samples: 6278144 | elapsed time per iteration (ms): 5680.0 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.222535E+00 | loss scale: 1.0 | grad norm: 0.642 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:08:01.383746 | finish at 2025-09-10 11:53:32 + [2025-09-10 02:45:36] iteration 6132/ 11920 | consumed samples: 6279168 | elapsed time per iteration (ms): 5670.5 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.214687E+00 | loss scale: 1.0 | grad norm: 1.250 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:07:00.776954 | finish at 2025-09-10 11:52:37 + [2025-09-10 02:45:42] iteration 6133/ 11920 | consumed samples: 6280192 | elapsed time per iteration (ms): 5689.6 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.225211E+00 | loss scale: 1.0 | grad norm: 1.221 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:08:45.975916 | finish at 2025-09-10 11:54:28 + [2025-09-10 02:45:48] iteration 6134/ 11920 | consumed samples: 6281216 | elapsed time per iteration (ms): 5682.2 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.206146E+00 | loss scale: 1.0 | grad norm: 1.450 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:07:57.396550 | finish at 2025-09-10 11:53:45 + [2025-09-10 02:45:54] iteration 6135/ 11920 | consumed samples: 6282240 | elapsed time per iteration (ms): 5989.2 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.268653E+00 | loss scale: 1.0 | grad norm: 2.434 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:37:27.444049 | finish at 2025-09-10 12:23:21 + [2025-09-10 02:45:59] iteration 6136/ 11920 | consumed samples: 6283264 | elapsed time per iteration (ms): 5680.8 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.234206E+00 | loss scale: 1.0 | grad norm: 1.622 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:07:37.886255 | finish at 2025-09-10 11:53:37 + [2025-09-10 02:46:05] iteration 6137/ 11920 | consumed samples: 6284288 | elapsed time per iteration (ms): 5666.6 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.185838E+00 | loss scale: 1.0 | grad norm: 0.707 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:06:09.947736 | finish at 2025-09-10 11:52:15 + [2025-09-10 02:46:11] iteration 6138/ 11920 | consumed samples: 6285312 | elapsed time per iteration (ms): 5663.8 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.175374E+00 | loss scale: 1.0 | grad norm: 0.572 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:05:48.157776 | finish at 2025-09-10 11:51:59 + [2025-09-10 02:46:16] iteration 6139/ 11920 | consumed samples: 6286336 | elapsed time per iteration (ms): 5682.9 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.143025E+00 | loss scale: 1.0 | grad norm: 0.631 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:07:33.081689 | finish at 2025-09-10 11:53:49 + [2025-09-10 02:46:22] iteration 6140/ 11920 | consumed samples: 6287360 | elapsed time per iteration (ms): 5679.1 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.144676E+00 | loss scale: 1.0 | grad norm: 1.026 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:07:05.217505 | finish at 2025-09-10 11:53:27 + [2025-09-10 02:46:28] iteration 6141/ 11920 | consumed samples: 6288384 | elapsed time per iteration (ms): 5858.8 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.114113E+00 | loss scale: 1.0 | grad norm: 0.483 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:24:18.273496 | finish at 2025-09-10 12:10:46 + [2025-09-10 02:46:34] iteration 6142/ 11920 | consumed samples: 6289408 | elapsed time per iteration (ms): 6050.6 | throughput per GPU (TFLOP/s/GPU): 74.6 | MFU 7.54% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.119018E+00 | loss scale: 1.0 | grad norm: 0.897 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:42:40.212811 | finish at 2025-09-10 12:29:14 + [2025-09-10 02:46:40] iteration 6143/ 11920 | consumed samples: 6290432 | elapsed time per iteration (ms): 5706.0 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 8.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.187118E+00 | loss scale: 1.0 | grad norm: 1.805 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:09:23.325615 | finish at 2025-09-10 11:56:03 + [2025-09-10 02:46:45] iteration 6144/ 11920 | consumed samples: 6291456 | elapsed time per iteration (ms): 5696.3 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.155473E+00 | loss scale: 1.0 | grad norm: 1.524 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:08:21.868908 | finish at 2025-09-10 11:55:07 + [2025-09-10 02:46:51] iteration 6145/ 11920 | consumed samples: 6292480 | elapsed time per iteration (ms): 5733.6 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.369349E+00 | loss scale: 1.0 | grad norm: 5.286 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:11:51.254418 | finish at 2025-09-10 11:58:42 + [2025-09-10 02:46:57] iteration 6146/ 11920 | consumed samples: 6293504 | elapsed time per iteration (ms): 5682.3 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.165153E+00 | loss scale: 1.0 | grad norm: 0.562 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:06:49.536023 | finish at 2025-09-10 11:53:46 + [2025-09-10 02:47:03] iteration 6147/ 11920 | consumed samples: 6294528 | elapsed time per iteration (ms): 5887.1 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.193103E+00 | loss scale: 1.0 | grad norm: 1.952 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:26:26.247462 | finish at 2025-09-10 12:13:29 + [2025-09-10 02:47:08] iteration 6148/ 11920 | consumed samples: 6295552 | elapsed time per iteration (ms): 5680.3 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.214998E+00 | loss scale: 1.0 | grad norm: 1.800 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:06:26.642043 | finish at 2025-09-10 11:53:35 + [2025-09-10 02:47:14] iteration 6149/ 11920 | consumed samples: 6296576 | elapsed time per iteration (ms): 5664.9 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.150190E+00 | loss scale: 1.0 | grad norm: 0.614 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:04:51.964907 | finish at 2025-09-10 11:52:06 + [2025-09-10 02:47:20] iteration 6150/ 11920 | consumed samples: 6297600 | elapsed time per iteration (ms): 5661.4 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.194606E+00 | loss scale: 1.0 | grad norm: 1.862 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:04:26.150522 | finish at 2025-09-10 11:51:46 + [2025-09-10 02:47:26] iteration 6151/ 11920 | consumed samples: 6298624 | elapsed time per iteration (ms): 5981.8 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.201031E+00 | loss scale: 1.0 | grad norm: 1.240 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:35:08.891872 | finish at 2025-09-10 12:22:35 + [2025-09-10 02:47:31] iteration 6152/ 11920 | consumed samples: 6299648 | elapsed time per iteration (ms): 5644.6 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.137528E+00 | loss scale: 1.0 | grad norm: 0.368 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:02:37.939541 | finish at 2025-09-10 11:50:09 + [2025-09-10 02:47:37] iteration 6153/ 11920 | consumed samples: 6300672 | elapsed time per iteration (ms): 5647.4 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.094027E+00 | loss scale: 1.0 | grad norm: 0.346 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:02:48.288495 | finish at 2025-09-10 11:50:25 + [2025-09-10 02:47:43] iteration 6154/ 11920 | consumed samples: 6301696 | elapsed time per iteration (ms): 5653.3 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.098108E+00 | loss scale: 1.0 | grad norm: 0.385 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:03:17.075166 | finish at 2025-09-10 11:51:00 + [2025-09-10 02:47:48] iteration 6155/ 11920 | consumed samples: 6302720 | elapsed time per iteration (ms): 5664.5 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.089793E+00 | loss scale: 1.0 | grad norm: 0.999 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:04:15.762769 | finish at 2025-09-10 11:52:04 + [2025-09-10 02:47:54] iteration 6156/ 11920 | consumed samples: 6303744 | elapsed time per iteration (ms): 5676.7 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.128735E+00 | loss scale: 1.0 | grad norm: 1.316 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:05:20.283708 | finish at 2025-09-10 11:53:14 + [2025-09-10 02:48:00] iteration 6157/ 11920 | consumed samples: 6304768 | elapsed time per iteration (ms): 5670.6 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.090266E+00 | loss scale: 1.0 | grad norm: 0.446 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:04:39.468209 | finish at 2025-09-10 11:52:39 + [2025-09-10 02:48:05] iteration 6158/ 11920 | consumed samples: 6305792 | elapsed time per iteration (ms): 5650.7 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.080190E+00 | loss scale: 1.0 | grad norm: 0.883 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:02:39.332558 | finish at 2025-09-10 11:50:45 + [2025-09-10 02:48:11] iteration 6159/ 11920 | consumed samples: 6306816 | elapsed time per iteration (ms): 5651.3 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.091148E+00 | loss scale: 1.0 | grad norm: 1.496 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:02:37.393135 | finish at 2025-09-10 11:50:48 + [2025-09-10 02:48:17] iteration 6160/ 11920 | consumed samples: 6307840 | elapsed time per iteration (ms): 5886.6 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.085335E+00 | loss scale: 1.0 | grad norm: 0.732 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:25:06.699371 | finish at 2025-09-10 12:13:24 + [2025-09-10 02:48:22] iteration 6161/ 11920 | consumed samples: 6308864 | elapsed time per iteration (ms): 5660.8 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.099086E+00 | loss scale: 1.0 | grad norm: 1.985 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:03:20.286206 | finish at 2025-09-10 11:51:43 + [2025-09-10 02:48:28] iteration 6162/ 11920 | consumed samples: 6309888 | elapsed time per iteration (ms): 5655.7 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.087770E+00 | loss scale: 1.0 | grad norm: 0.675 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:02:45.523163 | finish at 2025-09-10 11:51:14 + [2025-09-10 02:48:34] iteration 6163/ 11920 | consumed samples: 6310912 | elapsed time per iteration (ms): 5648.8 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.057478E+00 | loss scale: 1.0 | grad norm: 0.367 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:02:00.032569 | finish at 2025-09-10 11:50:34 + [2025-09-10 02:48:39] iteration 6164/ 11920 | consumed samples: 6311936 | elapsed time per iteration (ms): 5650.6 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.035487E+00 | loss scale: 1.0 | grad norm: 0.294 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:02:04.931573 | finish at 2025-09-10 11:50:44 + [2025-09-10 02:48:45] iteration 6165/ 11920 | consumed samples: 6312960 | elapsed time per iteration (ms): 5649.5 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.017428E+00 | loss scale: 1.0 | grad norm: 0.629 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:01:52.947351 | finish at 2025-09-10 11:50:38 + [2025-09-10 02:48:51] iteration 6166/ 11920 | consumed samples: 6313984 | elapsed time per iteration (ms): 5657.1 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.046869E+00 | loss scale: 1.0 | grad norm: 1.371 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:02:31.038238 | finish at 2025-09-10 11:51:22 + [2025-09-10 02:48:56] iteration 6167/ 11920 | consumed samples: 6315008 | elapsed time per iteration (ms): 5654.6 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.022135E+00 | loss scale: 1.0 | grad norm: 0.377 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:02:11.079220 | finish at 2025-09-10 11:51:07 + [2025-09-10 02:49:02] iteration 6168/ 11920 | consumed samples: 6316032 | elapsed time per iteration (ms): 5669.2 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.029613E+00 | loss scale: 1.0 | grad norm: 0.454 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:03:29.386185 | finish at 2025-09-10 11:52:31 + [2025-09-10 02:49:08] iteration 6169/ 11920 | consumed samples: 6317056 | elapsed time per iteration (ms): 5652.2 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.018476E+00 | loss scale: 1.0 | grad norm: 0.329 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:01:45.839127 | finish at 2025-09-10 11:50:54 + [2025-09-10 02:49:13] iteration 6170/ 11920 | consumed samples: 6318080 | elapsed time per iteration (ms): 5645.5 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.999650E+00 | loss scale: 1.0 | grad norm: 0.412 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:01:01.857736 | finish at 2025-09-10 11:50:15 + [2025-09-10 02:49:19] iteration 6171/ 11920 | consumed samples: 6319104 | elapsed time per iteration (ms): 5642.0 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.011604E+00 | loss scale: 1.0 | grad norm: 0.520 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:00:35.886554 | finish at 2025-09-10 11:49:55 + [2025-09-10 02:49:25] iteration 6172/ 11920 | consumed samples: 6320128 | elapsed time per iteration (ms): 5898.9 | throughput per GPU (TFLOP/s/GPU): 76.5 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.997916E+00 | loss scale: 1.0 | grad norm: 0.504 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:25:06.997982 | finish at 2025-09-10 12:14:32 + [2025-09-10 02:49:31] iteration 6173/ 11920 | consumed samples: 6321152 | elapsed time per iteration (ms): 5635.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.012982E+00 | loss scale: 1.0 | grad norm: 0.453 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:59:47.637516 | finish at 2025-09-10 11:49:18 + [2025-09-10 02:49:37] iteration 6174/ 11920 | consumed samples: 6322176 | elapsed time per iteration (ms): 5970.5 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.980908E+00 | loss scale: 1.0 | grad norm: 0.404 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:31:46.443638 | finish at 2025-09-10 12:21:23 + [2025-09-10 02:49:42] iteration 6175/ 11920 | consumed samples: 6323200 | elapsed time per iteration (ms): 5644.0 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.978774E+00 | loss scale: 1.0 | grad norm: 0.835 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:00:24.977546 | finish at 2025-09-10 11:50:07 + [2025-09-10 02:49:48] iteration 6176/ 11920 | consumed samples: 6324224 | elapsed time per iteration (ms): 5645.4 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.001881E+00 | loss scale: 1.0 | grad norm: 0.860 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:00:26.973820 | finish at 2025-09-10 11:50:15 + [2025-09-10 02:49:53] iteration 6177/ 11920 | consumed samples: 6325248 | elapsed time per iteration (ms): 5636.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.977857E+00 | loss scale: 1.0 | grad norm: 0.304 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:59:28.861997 | finish at 2025-09-10 11:49:22 + [2025-09-10 02:49:59] iteration 6178/ 11920 | consumed samples: 6326272 | elapsed time per iteration (ms): 5644.6 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.988091E+00 | loss scale: 1.0 | grad norm: 0.462 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:00:11.059979 | finish at 2025-09-10 11:50:10 + [2025-09-10 02:50:05] iteration 6179/ 11920 | consumed samples: 6327296 | elapsed time per iteration (ms): 5652.6 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.974932E+00 | loss scale: 1.0 | grad norm: 0.640 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:00:51.467386 | finish at 2025-09-10 11:50:56 + [2025-09-10 02:50:10] iteration 6180/ 11920 | consumed samples: 6328320 | elapsed time per iteration (ms): 5633.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.979075E+00 | loss scale: 1.0 | grad norm: 0.442 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:58:54.217257 | finish at 2025-09-10 11:49:05 + [2025-09-10 02:50:16] iteration 6181/ 11920 | consumed samples: 6329344 | elapsed time per iteration (ms): 5630.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.961057E+00 | loss scale: 1.0 | grad norm: 0.298 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:58:34.341648 | finish at 2025-09-10 11:48:50 + [2025-09-10 02:50:22] iteration 6182/ 11920 | consumed samples: 6330368 | elapsed time per iteration (ms): 5847.3 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.960157E+00 | loss scale: 1.0 | grad norm: 0.376 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:19:11.576502 | finish at 2025-09-10 12:09:33 + [2025-09-10 02:50:27] iteration 6183/ 11920 | consumed samples: 6331392 | elapsed time per iteration (ms): 5630.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.973500E+00 | loss scale: 1.0 | grad norm: 0.585 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:58:23.344321 | finish at 2025-09-10 11:48:51 + [2025-09-10 02:50:33] iteration 6184/ 11920 | consumed samples: 6332416 | elapsed time per iteration (ms): 5869.8 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.961017E+00 | loss scale: 1.0 | grad norm: 0.591 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:21:09.388029 | finish at 2025-09-10 12:11:43 + [2025-09-10 02:50:39] iteration 6185/ 11920 | consumed samples: 6333440 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.978157E+00 | loss scale: 1.0 | grad norm: 0.346 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:57:32.434430 | finish at 2025-09-10 11:48:11 + [2025-09-10 02:50:45] iteration 6186/ 11920 | consumed samples: 6334464 | elapsed time per iteration (ms): 5632.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.976298E+00 | loss scale: 1.0 | grad norm: 0.262 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:58:14.672536 | finish at 2025-09-10 11:48:59 + [2025-09-10 02:50:50] iteration 6187/ 11920 | consumed samples: 6335488 | elapsed time per iteration (ms): 5632.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.944590E+00 | loss scale: 1.0 | grad norm: 0.321 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:58:11.165857 | finish at 2025-09-10 11:49:01 + [2025-09-10 02:50:56] iteration 6188/ 11920 | consumed samples: 6336512 | elapsed time per iteration (ms): 5634.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.968551E+00 | loss scale: 1.0 | grad norm: 0.596 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:58:13.932567 | finish at 2025-09-10 11:49:10 + [2025-09-10 02:51:02] iteration 6189/ 11920 | consumed samples: 6337536 | elapsed time per iteration (ms): 5885.9 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.960462E+00 | loss scale: 1.0 | grad norm: 0.567 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:22:12.232388 | finish at 2025-09-10 12:13:14 + [2025-09-10 02:51:07] iteration 6190/ 11920 | consumed samples: 6338560 | elapsed time per iteration (ms): 5626.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.959719E+00 | loss scale: 1.0 | grad norm: 0.354 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:57:21.431830 | finish at 2025-09-10 11:48:29 + [2025-09-10 02:51:13] iteration 6191/ 11920 | consumed samples: 6339584 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.949875E+00 | loss scale: 1.0 | grad norm: 0.391 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:56:49.594797 | finish at 2025-09-10 11:48:03 + [2025-09-10 02:51:19] iteration 6192/ 11920 | consumed samples: 6340608 | elapsed time per iteration (ms): 5858.4 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.961555E+00 | loss scale: 1.0 | grad norm: 0.591 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:19:16.978630 | finish at 2025-09-10 12:10:36 + [2025-09-10 02:51:24] iteration 6193/ 11920 | consumed samples: 6341632 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.949220E+00 | loss scale: 1.0 | grad norm: 0.372 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:56:57.677959 | finish at 2025-09-10 11:48:22 + [2025-09-10 02:51:30] iteration 6194/ 11920 | consumed samples: 6342656 | elapsed time per iteration (ms): 5905.5 | throughput per GPU (TFLOP/s/GPU): 76.5 | MFU 7.73% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.936569E+00 | loss scale: 1.0 | grad norm: 0.286 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:23:34.795701 | finish at 2025-09-10 12:15:05 + [2025-09-10 02:51:36] iteration 6195/ 11920 | consumed samples: 6343680 | elapsed time per iteration (ms): 5932.6 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.939380E+00 | loss scale: 1.0 | grad norm: 0.251 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:26:04.073700 | finish at 2025-09-10 12:17:40 + [2025-09-10 02:51:42] iteration 6196/ 11920 | consumed samples: 6344704 | elapsed time per iteration (ms): 5952.8 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.945965E+00 | loss scale: 1.0 | grad norm: 0.448 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:27:54.051215 | finish at 2025-09-10 12:19:36 + [2025-09-10 02:51:48] iteration 6197/ 11920 | consumed samples: 6345728 | elapsed time per iteration (ms): 5629.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.951626E+00 | loss scale: 1.0 | grad norm: 0.697 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:56:58.770059 | finish at 2025-09-10 11:48:47 + [2025-09-10 02:51:54] iteration 6198/ 11920 | consumed samples: 6346752 | elapsed time per iteration (ms): 5639.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.942101E+00 | loss scale: 1.0 | grad norm: 0.457 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:57:49.747765 | finish at 2025-09-10 11:49:43 + [2025-09-10 02:51:59] iteration 6199/ 11920 | consumed samples: 6347776 | elapsed time per iteration (ms): 5633.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.939152E+00 | loss scale: 1.0 | grad norm: 0.554 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:57:06.778421 | finish at 2025-09-10 11:49:06 + [2025-09-10 02:52:05] iteration 6200/ 11920 | consumed samples: 6348800 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.950645E+00 | loss scale: 1.0 | grad norm: 0.483 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:56:39.663496 | finish at 2025-09-10 11:48:44 + [2025-09-10 02:52:11] iteration 6201/ 11920 | consumed samples: 6349824 | elapsed time per iteration (ms): 5990.2 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.948279E+00 | loss scale: 1.0 | grad norm: 0.424 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:30:57.891323 | finish at 2025-09-10 12:23:09 + [2025-09-10 02:52:16] iteration 6202/ 11920 | consumed samples: 6350848 | elapsed time per iteration (ms): 5630.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.944935E+00 | loss scale: 1.0 | grad norm: 0.256 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:56:36.993520 | finish at 2025-09-10 11:48:53 + [2025-09-10 02:52:22] iteration 6203/ 11920 | consumed samples: 6351872 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.947604E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:55:39.170579 | finish at 2025-09-10 11:48:01 + [2025-09-10 02:52:28] iteration 6204/ 11920 | consumed samples: 6352896 | elapsed time per iteration (ms): 5626.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.943413E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:56:00.376987 | finish at 2025-09-10 11:48:28 + [2025-09-10 02:52:34] iteration 6205/ 11920 | consumed samples: 6353920 | elapsed time per iteration (ms): 6075.5 | throughput per GPU (TFLOP/s/GPU): 74.3 | MFU 7.51% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930388E+00 | loss scale: 1.0 | grad norm: 0.304 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:38:41.489378 | finish at 2025-09-10 12:31:15 + [2025-09-10 02:52:39] iteration 6206/ 11920 | consumed samples: 6354944 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.927793E+00 | loss scale: 1.0 | grad norm: 0.330 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:55:31.993010 | finish at 2025-09-10 11:48:11 + [2025-09-10 02:52:45] iteration 6207/ 11920 | consumed samples: 6355968 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.927736E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:55:29.930121 | finish at 2025-09-10 11:48:15 + [2025-09-10 02:52:51] iteration 6208/ 11920 | consumed samples: 6356992 | elapsed time per iteration (ms): 5628.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.929900E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:55:48.639599 | finish at 2025-09-10 11:48:39 + [2025-09-10 02:52:56] iteration 6209/ 11920 | consumed samples: 6358016 | elapsed time per iteration (ms): 5637.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.917547E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:56:33.931409 | finish at 2025-09-10 11:49:30 + [2025-09-10 02:53:02] iteration 6210/ 11920 | consumed samples: 6359040 | elapsed time per iteration (ms): 5628.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.938299E+00 | loss scale: 1.0 | grad norm: 0.103 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:55:36.333456 | finish at 2025-09-10 11:48:38 + [2025-09-10 02:53:08] iteration 6211/ 11920 | consumed samples: 6360064 | elapsed time per iteration (ms): 5618.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.924111E+00 | loss scale: 1.0 | grad norm: 0.136 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:54:33.585486 | finish at 2025-09-10 11:47:41 + [2025-09-10 02:53:13] iteration 6212/ 11920 | consumed samples: 6361088 | elapsed time per iteration (ms): 5634.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.918611E+00 | loss scale: 1.0 | grad norm: 0.123 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:56:01.939813 | finish at 2025-09-10 11:49:15 + [2025-09-10 02:53:19] iteration 6213/ 11920 | consumed samples: 6362112 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.934352E+00 | loss scale: 1.0 | grad norm: 0.125 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:54:58.506019 | finish at 2025-09-10 11:48:17 + [2025-09-10 02:53:24] iteration 6214/ 11920 | consumed samples: 6363136 | elapsed time per iteration (ms): 5618.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.909753E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:54:17.233257 | finish at 2025-09-10 11:47:42 + [2025-09-10 02:53:30] iteration 6215/ 11920 | consumed samples: 6364160 | elapsed time per iteration (ms): 5618.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.916252E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:54:15.328381 | finish at 2025-09-10 11:47:45 + [2025-09-10 02:53:36] iteration 6216/ 11920 | consumed samples: 6365184 | elapsed time per iteration (ms): 5629.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.915910E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 10.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:55:11.793530 | finish at 2025-09-10 11:48:47 + [2025-09-10 02:53:41] iteration 6217/ 11920 | consumed samples: 6366208 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.911740E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:54:08.163060 | finish at 2025-09-10 11:47:49 + [2025-09-10 02:53:47] iteration 6218/ 11920 | consumed samples: 6367232 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.909403E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:54:29.921753 | finish at 2025-09-10 11:48:17 + [2025-09-10 02:53:53] iteration 6219/ 11920 | consumed samples: 6368256 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.917544E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:54:41.463068 | finish at 2025-09-10 11:48:34 + [2025-09-10 02:53:58] iteration 6220/ 11920 | consumed samples: 6369280 | elapsed time per iteration (ms): 5968.0 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.921566E+00 | loss scale: 1.0 | grad norm: 0.098 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:26:57.670298 | finish at 2025-09-10 12:20:56 + [2025-09-10 02:54:04] iteration 6221/ 11920 | consumed samples: 6370304 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.909309E+00 | loss scale: 1.0 | grad norm: 0.105 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:53:56.431285 | finish at 2025-09-10 11:48:01 + [2025-09-10 02:54:10] iteration 6222/ 11920 | consumed samples: 6371328 | elapsed time per iteration (ms): 5985.6 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.914963E+00 | loss scale: 1.0 | grad norm: 0.103 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:28:26.022129 | finish at 2025-09-10 12:22:36 + [2025-09-10 02:54:16] iteration 6223/ 11920 | consumed samples: 6372352 | elapsed time per iteration (ms): 5639.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.915302E+00 | loss scale: 1.0 | grad norm: 0.091 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:55:29.909768 | finish at 2025-09-10 11:49:46 + [2025-09-10 02:54:21] iteration 6224/ 11920 | consumed samples: 6373376 | elapsed time per iteration (ms): 5623.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.907406E+00 | loss scale: 1.0 | grad norm: 0.079 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:53:48.778580 | finish at 2025-09-10 11:48:10 + [2025-09-10 02:54:27] iteration 6225/ 11920 | consumed samples: 6374400 | elapsed time per iteration (ms): 5616.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.904938E+00 | loss scale: 1.0 | grad norm: 0.090 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:53:07.609866 | finish at 2025-09-10 11:47:35 + [2025-09-10 02:54:33] iteration 6226/ 11920 | consumed samples: 6375424 | elapsed time per iteration (ms): 5828.9 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.908079E+00 | loss scale: 1.0 | grad norm: 0.088 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:13:09.850834 | finish at 2025-09-10 12:07:43 + [2025-09-10 02:54:38] iteration 6227/ 11920 | consumed samples: 6376448 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.904969E+00 | loss scale: 1.0 | grad norm: 0.093 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:53:31.200970 | finish at 2025-09-10 11:48:10 + [2025-09-10 02:54:44] iteration 6228/ 11920 | consumed samples: 6377472 | elapsed time per iteration (ms): 5619.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.901635E+00 | loss scale: 1.0 | grad norm: 0.106 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:53:05.463447 | finish at 2025-09-10 11:47:49 + [2025-09-10 02:54:50] iteration 6229/ 11920 | consumed samples: 6378496 | elapsed time per iteration (ms): 5626.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.906752E+00 | loss scale: 1.0 | grad norm: 0.119 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:53:41.341674 | finish at 2025-09-10 11:48:31 + [2025-09-10 02:54:55] iteration 6230/ 11920 | consumed samples: 6379520 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.900719E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:53:27.020550 | finish at 2025-09-10 11:48:22 + [2025-09-10 02:55:01] iteration 6231/ 11920 | consumed samples: 6380544 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.909760E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:53:14.475249 | finish at 2025-09-10 11:48:15 + [2025-09-10 02:55:07] iteration 6232/ 11920 | consumed samples: 6381568 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.894747E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:52:58.094547 | finish at 2025-09-10 11:48:05 + [2025-09-10 02:55:12] iteration 6233/ 11920 | consumed samples: 6382592 | elapsed time per iteration (ms): 5626.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.894952E+00 | loss scale: 1.0 | grad norm: 0.229 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:53:20.036335 | finish at 2025-09-10 11:48:32 + [2025-09-10 02:55:18] iteration 6234/ 11920 | consumed samples: 6383616 | elapsed time per iteration (ms): 5631.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.898310E+00 | loss scale: 1.0 | grad norm: 0.303 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:53:43.139709 | finish at 2025-09-10 11:49:01 + [2025-09-10 02:55:23] iteration 6235/ 11920 | consumed samples: 6384640 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.904150E+00 | loss scale: 1.0 | grad norm: 0.336 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:52:53.958471 | finish at 2025-09-10 11:48:17 + [2025-09-10 02:55:29] iteration 6236/ 11920 | consumed samples: 6385664 | elapsed time per iteration (ms): 5945.5 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.908572E+00 | loss scale: 1.0 | grad norm: 0.293 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:23:14.099449 | finish at 2025-09-10 12:18:43 + [2025-09-10 02:55:36] iteration 6237/ 11920 | consumed samples: 6386688 | elapsed time per iteration (ms): 6199.6 | throughput per GPU (TFLOP/s/GPU): 72.8 | MFU 7.36% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.906334E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:47:12.043513 | finish at 2025-09-10 12:42:48 + [2025-09-10 02:55:41] iteration 6238/ 11920 | consumed samples: 6387712 | elapsed time per iteration (ms): 5876.7 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.901216E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:16:31.417099 | finish at 2025-09-10 12:12:13 + [2025-09-10 02:55:47] iteration 6239/ 11920 | consumed samples: 6388736 | elapsed time per iteration (ms): 5618.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.904015E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:51:57.971125 | finish at 2025-09-10 11:47:45 + [2025-09-10 02:55:53] iteration 6240/ 11920 | consumed samples: 6389760 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.898080E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:52:32.616348 | finish at 2025-09-10 11:48:25 + [2025-09-10 02:55:59] iteration 6241/ 11920 | consumed samples: 6390784 | elapsed time per iteration (ms): 5983.6 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.899453E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:26:21.008715 | finish at 2025-09-10 12:22:20 + [2025-09-10 02:56:05] iteration 6242/ 11920 | consumed samples: 6391808 | elapsed time per iteration (ms): 5893.4 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.899367E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:17:42.833237 | finish at 2025-09-10 12:13:47 + [2025-09-10 02:56:10] iteration 6243/ 11920 | consumed samples: 6392832 | elapsed time per iteration (ms): 5883.2 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.902054E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:16:38.909762 | finish at 2025-09-10 12:12:49 + [2025-09-10 02:56:16] iteration 6244/ 11920 | consumed samples: 6393856 | elapsed time per iteration (ms): 5638.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.890426E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:53:25.300492 | finish at 2025-09-10 11:49:41 + [2025-09-10 02:56:22] iteration 6245/ 11920 | consumed samples: 6394880 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.891986E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:51:34.859141 | finish at 2025-09-10 11:47:57 + [2025-09-10 02:56:27] iteration 6246/ 11920 | consumed samples: 6395904 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.895207E+00 | loss scale: 1.0 | grad norm: 0.135 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:51:28.718079 | finish at 2025-09-10 11:47:56 + [2025-09-10 02:56:33] iteration 6247/ 11920 | consumed samples: 6396928 | elapsed time per iteration (ms): 5830.3 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.890367E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:11:15.166639 | finish at 2025-09-10 12:07:48 + [2025-09-10 02:56:39] iteration 6248/ 11920 | consumed samples: 6397952 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.902921E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:51:17.829384 | finish at 2025-09-10 11:47:57 + [2025-09-10 02:56:44] iteration 6249/ 11920 | consumed samples: 6398976 | elapsed time per iteration (ms): 5639.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.896055E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:53:00.929265 | finish at 2025-09-10 11:49:45 + [2025-09-10 02:56:50] iteration 6250/ 11920 | consumed samples: 6400000 | elapsed time per iteration (ms): 5640.9 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.879855E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:53:03.838878 | finish at 2025-09-10 11:49:54 + [2025-09-10 02:56:56] iteration 6251/ 11920 | consumed samples: 6401024 | elapsed time per iteration (ms): 5911.2 | throughput per GPU (TFLOP/s/GPU): 76.4 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.899199E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:18:30.382215 | finish at 2025-09-10 12:15:26 + [2025-09-10 02:57:02] iteration 6252/ 11920 | consumed samples: 6402048 | elapsed time per iteration (ms): 5627.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.884281E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:51:37.101407 | finish at 2025-09-10 11:48:39 + [2025-09-10 02:57:08] iteration 6253/ 11920 | consumed samples: 6403072 | elapsed time per iteration (ms): 6178.1 | throughput per GPU (TFLOP/s/GPU): 73.1 | MFU 7.39% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.882622E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:43:31.528413 | finish at 2025-09-10 12:40:39 + [2025-09-10 02:57:13] iteration 6254/ 11920 | consumed samples: 6404096 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.893254E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:51:09.754576 | finish at 2025-09-10 11:48:23 + [2025-09-10 02:57:19] iteration 6255/ 11920 | consumed samples: 6405120 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.910282E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:50:46.136597 | finish at 2025-09-10 11:48:05 + [2025-09-10 02:57:25] iteration 6256/ 11920 | consumed samples: 6406144 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.902608E+00 | loss scale: 1.0 | grad norm: 0.131 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:50:38.466476 | finish at 2025-09-10 11:48:03 + [2025-09-10 02:57:30] iteration 6257/ 11920 | consumed samples: 6407168 | elapsed time per iteration (ms): 5617.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.894874E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:50:12.946555 | finish at 2025-09-10 11:47:43 + [2025-09-10 02:57:36] iteration 6258/ 11920 | consumed samples: 6408192 | elapsed time per iteration (ms): 5932.3 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.889298E+00 | loss scale: 1.0 | grad norm: 0.131 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:19:48.834306 | finish at 2025-09-10 12:17:25 + [2025-09-10 02:57:42] iteration 6259/ 11920 | consumed samples: 6409216 | elapsed time per iteration (ms): 6264.4 | throughput per GPU (TFLOP/s/GPU): 72.1 | MFU 7.29% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.895027E+00 | loss scale: 1.0 | grad norm: 0.124 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:51:02.513339 | finish at 2025-09-10 12:48:45 + [2025-09-10 02:57:48] iteration 6260/ 11920 | consumed samples: 6410240 | elapsed time per iteration (ms): 5809.2 | throughput per GPU (TFLOP/s/GPU): 77.7 | MFU 7.86% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.894473E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:07:59.988608 | finish at 2025-09-10 12:05:48 + [2025-09-10 02:57:54] iteration 6261/ 11920 | consumed samples: 6411264 | elapsed time per iteration (ms): 5631.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.885331E+00 | loss scale: 1.0 | grad norm: 0.129 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:51:07.225671 | finish at 2025-09-10 11:49:01 + [2025-09-10 02:58:00] iteration 6262/ 11920 | consumed samples: 6412288 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.891915E+00 | loss scale: 1.0 | grad norm: 0.120 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:50:05.852190 | finish at 2025-09-10 11:48:05 + [2025-09-10 02:58:05] iteration 6263/ 11920 | consumed samples: 6413312 | elapsed time per iteration (ms): 5626.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.892953E+00 | loss scale: 1.0 | grad norm: 0.111 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:50:31.637412 | finish at 2025-09-10 11:48:37 + [2025-09-10 02:58:11] iteration 6264/ 11920 | consumed samples: 6414336 | elapsed time per iteration (ms): 5627.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.880047E+00 | loss scale: 1.0 | grad norm: 0.097 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:50:26.273422 | finish at 2025-09-10 11:48:37 + [2025-09-10 02:58:16] iteration 6265/ 11920 | consumed samples: 6415360 | elapsed time per iteration (ms): 5628.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.896835E+00 | loss scale: 1.0 | grad norm: 0.101 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:50:30.585780 | finish at 2025-09-10 11:48:47 + [2025-09-10 02:58:22] iteration 6266/ 11920 | consumed samples: 6416384 | elapsed time per iteration (ms): 5618.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.886279E+00 | loss scale: 1.0 | grad norm: 0.093 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:49:24.395944 | finish at 2025-09-10 11:47:46 + [2025-09-10 02:58:28] iteration 6267/ 11920 | consumed samples: 6417408 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.888237E+00 | loss scale: 1.0 | grad norm: 0.094 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:49:34.274681 | finish at 2025-09-10 11:48:02 + [2025-09-10 02:58:33] iteration 6268/ 11920 | consumed samples: 6418432 | elapsed time per iteration (ms): 5617.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.893646E+00 | loss scale: 1.0 | grad norm: 0.116 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:49:08.145661 | finish at 2025-09-10 11:47:41 + [2025-09-10 02:58:39] iteration 6269/ 11920 | consumed samples: 6419456 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.901499E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:49:31.137149 | finish at 2025-09-10 11:48:10 + [2025-09-10 02:58:45] iteration 6270/ 11920 | consumed samples: 6420480 | elapsed time per iteration (ms): 5627.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.883248E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:49:57.612798 | finish at 2025-09-10 11:48:42 + [2025-09-10 02:58:50] iteration 6271/ 11920 | consumed samples: 6421504 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.888235E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:49:12.491907 | finish at 2025-09-10 11:48:03 + [2025-09-10 02:58:56] iteration 6272/ 11920 | consumed samples: 6422528 | elapsed time per iteration (ms): 5629.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.891717E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:49:57.368057 | finish at 2025-09-10 11:48:53 + [2025-09-10 02:59:01] iteration 6273/ 11920 | consumed samples: 6423552 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.904163E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:49:05.282415 | finish at 2025-09-10 11:48:07 + [2025-09-10 02:59:07] iteration 6274/ 11920 | consumed samples: 6424576 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.886690E+00 | loss scale: 1.0 | grad norm: 0.135 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:48:58.460064 | finish at 2025-09-10 11:48:05 + [2025-09-10 02:59:13] iteration 6275/ 11920 | consumed samples: 6425600 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.897422E+00 | loss scale: 1.0 | grad norm: 0.124 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:49:11.427854 | finish at 2025-09-10 11:48:24 + [2025-09-10 02:59:18] iteration 6276/ 11920 | consumed samples: 6426624 | elapsed time per iteration (ms): 5626.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.899146E+00 | loss scale: 1.0 | grad norm: 0.130 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:49:13.295648 | finish at 2025-09-10 11:48:32 + [2025-09-10 02:59:24] iteration 6277/ 11920 | consumed samples: 6427648 | elapsed time per iteration (ms): 6201.6 | throughput per GPU (TFLOP/s/GPU): 72.8 | MFU 7.36% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.894032E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:43:15.782597 | finish at 2025-09-10 12:42:40 + [2025-09-10 02:59:30] iteration 6278/ 11920 | consumed samples: 6428672 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.883724E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:48:36.624150 | finish at 2025-09-10 11:48:07 + [2025-09-10 02:59:36] iteration 6279/ 11920 | consumed samples: 6429696 | elapsed time per iteration (ms): 5619.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.892531E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:48:20.322625 | finish at 2025-09-10 11:47:56 + [2025-09-10 02:59:41] iteration 6280/ 11920 | consumed samples: 6430720 | elapsed time per iteration (ms): 5617.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.875690E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:48:04.051781 | finish at 2025-09-10 11:47:45 + [2025-09-10 02:59:47] iteration 6281/ 11920 | consumed samples: 6431744 | elapsed time per iteration (ms): 5619.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.899074E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:48:09.004047 | finish at 2025-09-10 11:47:56 + [2025-09-10 02:59:53] iteration 6282/ 11920 | consumed samples: 6432768 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.886855E+00 | loss scale: 1.0 | grad norm: 0.134 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:48:40.816480 | finish at 2025-09-10 11:48:33 + [2025-09-10 02:59:58] iteration 6283/ 11920 | consumed samples: 6433792 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.891378E+00 | loss scale: 1.0 | grad norm: 0.118 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:48:23.534014 | finish at 2025-09-10 11:48:22 + [2025-09-10 03:00:04] iteration 6284/ 11920 | consumed samples: 6434816 | elapsed time per iteration (ms): 5938.6 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.886236E+00 | loss scale: 1.0 | grad norm: 0.122 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:17:49.999675 | finish at 2025-09-10 12:17:54 + [2025-09-10 03:00:10] iteration 6285/ 11920 | consumed samples: 6435840 | elapsed time per iteration (ms): 5618.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.887593E+00 | loss scale: 1.0 | grad norm: 0.136 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:47:41.562743 | finish at 2025-09-10 11:47:51 + [2025-09-10 03:00:15] iteration 6286/ 11920 | consumed samples: 6436864 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.882899E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:48:12.887422 | finish at 2025-09-10 11:48:28 + [2025-09-10 03:00:21] iteration 6287/ 11920 | consumed samples: 6437888 | elapsed time per iteration (ms): 5626.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.894331E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:48:14.213561 | finish at 2025-09-10 11:48:35 + [2025-09-10 03:00:27] iteration 6288/ 11920 | consumed samples: 6438912 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.895937E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:47:53.122314 | finish at 2025-09-10 11:48:20 + [2025-09-10 03:00:32] iteration 6289/ 11920 | consumed samples: 6439936 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.889596E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:47:42.434494 | finish at 2025-09-10 11:48:15 + [2025-09-10 03:00:38] iteration 6290/ 11920 | consumed samples: 6440960 | elapsed time per iteration (ms): 5617.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.893739E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:47:08.905268 | finish at 2025-09-10 11:47:47 + [2025-09-10 03:00:43] iteration 6291/ 11920 | consumed samples: 6441984 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.878446E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:47:12.444208 | finish at 2025-09-10 11:47:56 + [2025-09-10 03:00:49] iteration 6292/ 11920 | consumed samples: 6443008 | elapsed time per iteration (ms): 5618.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.892399E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:46:57.900215 | finish at 2025-09-10 11:47:47 + [2025-09-10 03:00:55] iteration 6293/ 11920 | consumed samples: 6444032 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.873011E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:47:19.767228 | finish at 2025-09-10 11:48:14 + [2025-09-10 03:01:00] iteration 6294/ 11920 | consumed samples: 6445056 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.892490E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:47:32.379937 | finish at 2025-09-10 11:48:33 + [2025-09-10 03:01:06] iteration 6295/ 11920 | consumed samples: 6446080 | elapsed time per iteration (ms): 5618.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.879565E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:46:41.169705 | finish at 2025-09-10 11:47:47 + [2025-09-10 03:01:12] iteration 6296/ 11920 | consumed samples: 6447104 | elapsed time per iteration (ms): 5615.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870356E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:46:18.713123 | finish at 2025-09-10 11:47:30 + [2025-09-10 03:01:17] iteration 6297/ 11920 | consumed samples: 6448128 | elapsed time per iteration (ms): 5633.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.891805E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:47:55.912208 | finish at 2025-09-10 11:49:13 + [2025-09-10 03:01:23] iteration 6298/ 11920 | consumed samples: 6449152 | elapsed time per iteration (ms): 5618.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.885427E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:46:29.066088 | finish at 2025-09-10 11:47:52 + [2025-09-10 03:01:28] iteration 6299/ 11920 | consumed samples: 6450176 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.879183E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:46:34.480719 | finish at 2025-09-10 11:48:03 + [2025-09-10 03:01:34] iteration 6300/ 11920 | consumed samples: 6451200 | elapsed time per iteration (ms): 5951.7 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.891868E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:17:28.819146 | finish at 2025-09-10 12:19:03 + [2025-09-10 03:01:40] iteration 6301/ 11920 | consumed samples: 6452224 | elapsed time per iteration (ms): 5614.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.894354E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:45:47.739110 | finish at 2025-09-10 11:47:28 + [2025-09-10 03:01:46] iteration 6302/ 11920 | consumed samples: 6453248 | elapsed time per iteration (ms): 5628.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.896089E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:47:02.171984 | finish at 2025-09-10 11:48:48 + [2025-09-10 03:01:51] iteration 6303/ 11920 | consumed samples: 6454272 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.885639E+00 | loss scale: 1.0 | grad norm: 0.240 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:46:17.449416 | finish at 2025-09-10 11:48:09 + [2025-09-10 03:01:57] iteration 6304/ 11920 | consumed samples: 6455296 | elapsed time per iteration (ms): 5623.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.884815E+00 | loss scale: 1.0 | grad norm: 0.246 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:46:19.522648 | finish at 2025-09-10 11:48:16 + [2025-09-10 03:02:03] iteration 6305/ 11920 | consumed samples: 6456320 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.880569E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:46:05.100105 | finish at 2025-09-10 11:48:08 + [2025-09-10 03:02:08] iteration 6306/ 11920 | consumed samples: 6457344 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.897469E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:46:00.068808 | finish at 2025-09-10 11:48:08 + [2025-09-10 03:02:14] iteration 6307/ 11920 | consumed samples: 6458368 | elapsed time per iteration (ms): 5890.8 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.881668E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:11:05.190205 | finish at 2025-09-10 12:13:19 + [2025-09-10 03:02:20] iteration 6308/ 11920 | consumed samples: 6459392 | elapsed time per iteration (ms): 5626.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.910280E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:46:14.002705 | finish at 2025-09-10 11:48:34 + [2025-09-10 03:02:25] iteration 6309/ 11920 | consumed samples: 6460416 | elapsed time per iteration (ms): 5615.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.876245E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:45:07.423884 | finish at 2025-09-10 11:47:33 + [2025-09-10 03:02:31] iteration 6310/ 11920 | consumed samples: 6461440 | elapsed time per iteration (ms): 5618.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.887029E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:45:16.791580 | finish at 2025-09-10 11:47:48 + [2025-09-10 03:02:37] iteration 6311/ 11920 | consumed samples: 6462464 | elapsed time per iteration (ms): 5841.4 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870728E+00 | loss scale: 1.0 | grad norm: 0.130 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:06:04.430806 | finish at 2025-09-10 12:08:41 + [2025-09-10 03:02:42] iteration 6312/ 11920 | consumed samples: 6463488 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.884902E+00 | loss scale: 1.0 | grad norm: 0.115 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:45:23.275589 | finish at 2025-09-10 11:48:06 + [2025-09-10 03:02:48] iteration 6313/ 11920 | consumed samples: 6464512 | elapsed time per iteration (ms): 5934.7 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.885908E+00 | loss scale: 1.0 | grad norm: 0.114 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:14:35.945851 | finish at 2025-09-10 12:17:24 + [2025-09-10 03:02:54] iteration 6314/ 11920 | consumed samples: 6465536 | elapsed time per iteration (ms): 5619.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.891373E+00 | loss scale: 1.0 | grad norm: 0.115 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:45:01.825917 | finish at 2025-09-10 11:47:56 + [2025-09-10 03:03:00] iteration 6315/ 11920 | consumed samples: 6466560 | elapsed time per iteration (ms): 5624.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.879971E+00 | loss scale: 1.0 | grad norm: 0.122 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:45:26.010916 | finish at 2025-09-10 11:48:26 + [2025-09-10 03:03:05] iteration 6316/ 11920 | consumed samples: 6467584 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.878773E+00 | loss scale: 1.0 | grad norm: 0.112 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:45:17.680696 | finish at 2025-09-10 11:48:23 + [2025-09-10 03:03:11] iteration 6317/ 11920 | consumed samples: 6468608 | elapsed time per iteration (ms): 5617.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.892817E+00 | loss scale: 1.0 | grad norm: 0.128 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:44:36.576133 | finish at 2025-09-10 11:47:47 + [2025-09-10 03:03:16] iteration 6318/ 11920 | consumed samples: 6469632 | elapsed time per iteration (ms): 5629.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.897470E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:45:37.818171 | finish at 2025-09-10 11:48:54 + [2025-09-10 03:03:22] iteration 6319/ 11920 | consumed samples: 6470656 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.877409E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:44:40.618628 | finish at 2025-09-10 11:48:03 + [2025-09-10 03:03:28] iteration 6320/ 11920 | consumed samples: 6471680 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.886080E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:44:58.970604 | finish at 2025-09-10 11:48:27 + [2025-09-10 03:03:33] iteration 6321/ 11920 | consumed samples: 6472704 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.889770E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:44:27.584779 | finish at 2025-09-10 11:48:01 + [2025-09-10 03:03:39] iteration 6322/ 11920 | consumed samples: 6473728 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.897185E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:44:22.671937 | finish at 2025-09-10 11:48:02 + [2025-09-10 03:03:45] iteration 6323/ 11920 | consumed samples: 6474752 | elapsed time per iteration (ms): 5622.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.890559E+00 | loss scale: 1.0 | grad norm: 0.257 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:44:29.826083 | finish at 2025-09-10 11:48:14 + [2025-09-10 03:03:50] iteration 6324/ 11920 | consumed samples: 6475776 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.891227E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:44:39.306494 | finish at 2025-09-10 11:48:29 + [2025-09-10 03:03:56] iteration 6325/ 11920 | consumed samples: 6476800 | elapsed time per iteration (ms): 5837.0 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.877011E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:04:17.931697 | finish at 2025-09-10 12:08:14 + [2025-09-10 03:04:02] iteration 6326/ 11920 | consumed samples: 6477824 | elapsed time per iteration (ms): 5628.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.869711E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:44:44.499202 | finish at 2025-09-10 11:48:46 + [2025-09-10 03:04:08] iteration 6327/ 11920 | consumed samples: 6478848 | elapsed time per iteration (ms): 5945.7 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.887532E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:14:14.475728 | finish at 2025-09-10 12:18:22 + [2025-09-10 03:04:13] iteration 6328/ 11920 | consumed samples: 6479872 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.875909E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:43:55.530745 | finish at 2025-09-10 11:48:09 + [2025-09-10 03:04:19] iteration 6329/ 11920 | consumed samples: 6480896 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.891444E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:43:42.672379 | finish at 2025-09-10 11:48:01 + [2025-09-10 03:04:24] iteration 6330/ 11920 | consumed samples: 6481920 | elapsed time per iteration (ms): 5618.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.887428E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:43:24.893386 | finish at 2025-09-10 11:47:49 + [2025-09-10 03:04:30] iteration 6331/ 11920 | consumed samples: 6482944 | elapsed time per iteration (ms): 5619.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.883957E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:43:24.712025 | finish at 2025-09-10 11:47:55 + [2025-09-10 03:04:36] iteration 6332/ 11920 | consumed samples: 6483968 | elapsed time per iteration (ms): 5957.6 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.894428E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:14:51.337241 | finish at 2025-09-10 12:19:27 + [2025-09-10 03:04:42] iteration 6333/ 11920 | consumed samples: 6484992 | elapsed time per iteration (ms): 5965.9 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.893810E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:15:31.419051 | finish at 2025-09-10 12:20:13 + [2025-09-10 03:04:48] iteration 6334/ 11920 | consumed samples: 6486016 | elapsed time per iteration (ms): 5616.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.878639E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:42:53.419513 | finish at 2025-09-10 11:47:41 + [2025-09-10 03:04:53] iteration 6335/ 11920 | consumed samples: 6487040 | elapsed time per iteration (ms): 5629.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.884915E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:43:57.899466 | finish at 2025-09-10 11:48:51 + [2025-09-10 03:04:59] iteration 6336/ 11920 | consumed samples: 6488064 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.874392E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:43:28.795147 | finish at 2025-09-10 11:48:28 + [2025-09-10 03:05:04] iteration 6337/ 11920 | consumed samples: 6489088 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.877864E+00 | loss scale: 1.0 | grad norm: 0.121 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:43:48.222825 | finish at 2025-09-10 11:48:53 + [2025-09-10 03:05:10] iteration 6338/ 11920 | consumed samples: 6490112 | elapsed time per iteration (ms): 5825.4 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.84% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.869905E+00 | loss scale: 1.0 | grad norm: 0.117 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:01:57.190463 | finish at 2025-09-10 12:07:07 + [2025-09-10 03:05:16] iteration 6339/ 11920 | consumed samples: 6491136 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.886484E+00 | loss scale: 1.0 | grad norm: 0.111 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:43:11.820998 | finish at 2025-09-10 11:48:28 + [2025-09-10 03:05:22] iteration 6340/ 11920 | consumed samples: 6492160 | elapsed time per iteration (ms): 5977.2 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.869137E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 10.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:15:52.565975 | finish at 2025-09-10 12:21:14 + [2025-09-10 03:05:27] iteration 6341/ 11920 | consumed samples: 6493184 | elapsed time per iteration (ms): 5614.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.868551E+00 | loss scale: 1.0 | grad norm: 0.121 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:42:00.470544 | finish at 2025-09-10 11:47:28 + [2025-09-10 03:05:33] iteration 6342/ 11920 | consumed samples: 6494208 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.873390E+00 | loss scale: 1.0 | grad norm: 0.128 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:42:32.225378 | finish at 2025-09-10 11:48:05 + [2025-09-10 03:05:39] iteration 6343/ 11920 | consumed samples: 6495232 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.880509E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:42:33.561468 | finish at 2025-09-10 11:48:12 + [2025-09-10 03:05:45] iteration 6344/ 11920 | consumed samples: 6496256 | elapsed time per iteration (ms): 5943.9 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.864063E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 10.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:12:23.010191 | finish at 2025-09-10 12:18:08 + [2025-09-10 03:05:50] iteration 6345/ 11920 | consumed samples: 6497280 | elapsed time per iteration (ms): 5619.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.880467E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:42:06.912349 | finish at 2025-09-10 11:47:57 + [2025-09-10 03:05:56] iteration 6346/ 11920 | consumed samples: 6498304 | elapsed time per iteration (ms): 5920.6 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.887858E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:10:01.626051 | finish at 2025-09-10 12:15:58 + [2025-09-10 03:06:02] iteration 6347/ 11920 | consumed samples: 6499328 | elapsed time per iteration (ms): 5627.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.890934E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:42:40.105949 | finish at 2025-09-10 11:48:42 + [2025-09-10 03:06:07] iteration 6348/ 11920 | consumed samples: 6500352 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.888988E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:42:18.622201 | finish at 2025-09-10 11:48:26 + [2025-09-10 03:06:13] iteration 6349/ 11920 | consumed samples: 6501376 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.882996E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:42:20.519663 | finish at 2025-09-10 11:48:34 + [2025-09-10 03:06:19] iteration 6350/ 11920 | consumed samples: 6502400 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.885272E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:42:15.401301 | finish at 2025-09-10 11:48:34 + [2025-09-10 03:06:24] iteration 6351/ 11920 | consumed samples: 6503424 | elapsed time per iteration (ms): 5616.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.900105E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:41:20.573011 | finish at 2025-09-10 11:47:45 + [2025-09-10 03:06:30] iteration 6352/ 11920 | consumed samples: 6504448 | elapsed time per iteration (ms): 5616.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.885856E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:41:15.015839 | finish at 2025-09-10 11:47:45 + [2025-09-10 03:06:36] iteration 6353/ 11920 | consumed samples: 6505472 | elapsed time per iteration (ms): 5623.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.872308E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:41:43.374535 | finish at 2025-09-10 11:48:19 + [2025-09-10 03:06:41] iteration 6354/ 11920 | consumed samples: 6506496 | elapsed time per iteration (ms): 5628.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.871725E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:42:09.646864 | finish at 2025-09-10 11:48:51 + [2025-09-10 03:06:47] iteration 6355/ 11920 | consumed samples: 6507520 | elapsed time per iteration (ms): 5626.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.877270E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:41:53.045479 | finish at 2025-09-10 11:48:40 + [2025-09-10 03:06:52] iteration 6356/ 11920 | consumed samples: 6508544 | elapsed time per iteration (ms): 5627.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.887357E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:41:48.889853 | finish at 2025-09-10 11:48:41 + [2025-09-10 03:06:58] iteration 6357/ 11920 | consumed samples: 6509568 | elapsed time per iteration (ms): 5628.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.882268E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:41:51.659754 | finish at 2025-09-10 11:48:50 + [2025-09-10 03:07:04] iteration 6358/ 11920 | consumed samples: 6510592 | elapsed time per iteration (ms): 5951.6 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.893567E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:11:42.871224 | finish at 2025-09-10 12:18:47 + [2025-09-10 03:07:10] iteration 6359/ 11920 | consumed samples: 6511616 | elapsed time per iteration (ms): 5834.9 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.880900E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:00:47.833278 | finish at 2025-09-10 12:07:58 + [2025-09-10 03:07:16] iteration 6360/ 11920 | consumed samples: 6512640 | elapsed time per iteration (ms): 5627.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.883519E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:41:25.964098 | finish at 2025-09-10 11:48:41 + [2025-09-10 03:07:21] iteration 6361/ 11920 | consumed samples: 6513664 | elapsed time per iteration (ms): 5628.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.883590E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:41:25.776440 | finish at 2025-09-10 11:48:47 + [2025-09-10 03:07:27] iteration 6362/ 11920 | consumed samples: 6514688 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.891175E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:40:54.256765 | finish at 2025-09-10 11:48:21 + [2025-09-10 03:07:32] iteration 6363/ 11920 | consumed samples: 6515712 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.882812E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:40:47.903458 | finish at 2025-09-10 11:48:20 + [2025-09-10 03:07:38] iteration 6364/ 11920 | consumed samples: 6516736 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.878894E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:40:26.698397 | finish at 2025-09-10 11:48:05 + [2025-09-10 03:07:44] iteration 6365/ 11920 | consumed samples: 6517760 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.882696E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:40:27.194190 | finish at 2025-09-10 11:48:11 + [2025-09-10 03:07:49] iteration 6366/ 11920 | consumed samples: 6518784 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.878384E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:40:13.072842 | finish at 2025-09-10 11:48:02 + [2025-09-10 03:07:55] iteration 6367/ 11920 | consumed samples: 6519808 | elapsed time per iteration (ms): 5627.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.876689E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:40:51.366627 | finish at 2025-09-10 11:48:46 + [2025-09-10 03:08:01] iteration 6368/ 11920 | consumed samples: 6520832 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.890213E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:40:34.222603 | finish at 2025-09-10 11:48:35 + [2025-09-10 03:08:06] iteration 6369/ 11920 | consumed samples: 6521856 | elapsed time per iteration (ms): 5957.3 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.879056E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:11:08.709965 | finish at 2025-09-10 12:19:15 + [2025-09-10 03:08:12] iteration 6370/ 11920 | consumed samples: 6522880 | elapsed time per iteration (ms): 5619.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.885639E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:39:47.934780 | finish at 2025-09-10 11:48:00 + [2025-09-10 03:08:18] iteration 6371/ 11920 | consumed samples: 6523904 | elapsed time per iteration (ms): 5641.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.878289E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:41:43.636998 | finish at 2025-09-10 11:50:01 + [2025-09-10 03:08:23] iteration 6372/ 11920 | consumed samples: 6524928 | elapsed time per iteration (ms): 5615.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.897175E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:39:16.496226 | finish at 2025-09-10 11:47:40 + [2025-09-10 03:08:29] iteration 6373/ 11920 | consumed samples: 6525952 | elapsed time per iteration (ms): 5618.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.880913E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:39:24.737657 | finish at 2025-09-10 11:47:54 + [2025-09-10 03:08:35] iteration 6374/ 11920 | consumed samples: 6526976 | elapsed time per iteration (ms): 5617.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.877968E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:39:11.764889 | finish at 2025-09-10 11:47:46 + [2025-09-10 03:08:40] iteration 6375/ 11920 | consumed samples: 6528000 | elapsed time per iteration (ms): 5905.9 | throughput per GPU (TFLOP/s/GPU): 76.4 | MFU 7.73% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.879757E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:05:48.309373 | finish at 2025-09-10 12:14:29 + [2025-09-10 03:08:46] iteration 6376/ 11920 | consumed samples: 6529024 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870087E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:39:17.214598 | finish at 2025-09-10 11:48:03 + [2025-09-10 03:08:52] iteration 6377/ 11920 | consumed samples: 6530048 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.879221E+00 | loss scale: 1.0 | grad norm: 0.122 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:39:32.902028 | finish at 2025-09-10 11:48:25 + [2025-09-10 03:08:57] iteration 6378/ 11920 | consumed samples: 6531072 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.881279E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:39:12.978916 | finish at 2025-09-10 11:48:10 + [2025-09-10 03:09:03] iteration 6379/ 11920 | consumed samples: 6532096 | elapsed time per iteration (ms): 5627.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.871281E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:39:39.557603 | finish at 2025-09-10 11:48:43 + [2025-09-10 03:09:09] iteration 6380/ 11920 | consumed samples: 6533120 | elapsed time per iteration (ms): 5622.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.890910E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:39:08.538733 | finish at 2025-09-10 11:48:17 + [2025-09-10 03:09:14] iteration 6381/ 11920 | consumed samples: 6534144 | elapsed time per iteration (ms): 5622.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.878273E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:39:03.836711 | finish at 2025-09-10 11:48:18 + [2025-09-10 03:09:20] iteration 6382/ 11920 | consumed samples: 6535168 | elapsed time per iteration (ms): 5629.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.863234E+00 | loss scale: 1.0 | grad norm: 0.129 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:39:35.887957 | finish at 2025-09-10 11:48:56 + [2025-09-10 03:09:25] iteration 6383/ 11920 | consumed samples: 6536192 | elapsed time per iteration (ms): 5618.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.868986E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:38:27.979033 | finish at 2025-09-10 11:47:53 + [2025-09-10 03:09:31] iteration 6384/ 11920 | consumed samples: 6537216 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.887527E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:38:57.443382 | finish at 2025-09-10 11:48:29 + [2025-09-10 03:09:37] iteration 6385/ 11920 | consumed samples: 6538240 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.874882E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:38:53.823388 | finish at 2025-09-10 11:48:31 + [2025-09-10 03:09:42] iteration 6386/ 11920 | consumed samples: 6539264 | elapsed time per iteration (ms): 5615.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.881650E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:37:56.292960 | finish at 2025-09-10 11:47:39 + [2025-09-10 03:09:48] iteration 6387/ 11920 | consumed samples: 6540288 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.872385E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:38:22.060493 | finish at 2025-09-10 11:48:10 + [2025-09-10 03:09:54] iteration 6388/ 11920 | consumed samples: 6541312 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.884268E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:38:31.030641 | finish at 2025-09-10 11:48:25 + [2025-09-10 03:10:00] iteration 6389/ 11920 | consumed samples: 6542336 | elapsed time per iteration (ms): 5986.9 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.865644E+00 | loss scale: 1.0 | grad norm: 0.247 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:11:53.301765 | finish at 2025-09-10 12:21:53 + [2025-09-10 03:10:05] iteration 6390/ 11920 | consumed samples: 6543360 | elapsed time per iteration (ms): 5619.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.880389E+00 | loss scale: 1.0 | grad norm: 0.251 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:37:55.167429 | finish at 2025-09-10 11:48:00 + [2025-09-10 03:10:11] iteration 6391/ 11920 | consumed samples: 6544384 | elapsed time per iteration (ms): 5626.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.890341E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:38:26.266966 | finish at 2025-09-10 11:48:37 + [2025-09-10 03:10:16] iteration 6392/ 11920 | consumed samples: 6545408 | elapsed time per iteration (ms): 5636.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.893361E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:39:20.394110 | finish at 2025-09-10 11:49:37 + [2025-09-10 03:10:22] iteration 6393/ 11920 | consumed samples: 6546432 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.872663E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:38:02.184096 | finish at 2025-09-10 11:48:24 + [2025-09-10 03:10:28] iteration 6394/ 11920 | consumed samples: 6547456 | elapsed time per iteration (ms): 5616.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.884434E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:37:19.248766 | finish at 2025-09-10 11:47:47 + [2025-09-10 03:10:33] iteration 6395/ 11920 | consumed samples: 6548480 | elapsed time per iteration (ms): 5616.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.874454E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:37:13.633137 | finish at 2025-09-10 11:47:47 + [2025-09-10 03:10:39] iteration 6396/ 11920 | consumed samples: 6549504 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.884411E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:37:26.260926 | finish at 2025-09-10 11:48:05 + [2025-09-10 03:10:45] iteration 6397/ 11920 | consumed samples: 6550528 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.890603E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:37:23.866802 | finish at 2025-09-10 11:48:08 + [2025-09-10 03:10:51] iteration 6398/ 11920 | consumed samples: 6551552 | elapsed time per iteration (ms): 5953.9 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.863027E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:07:57.413964 | finish at 2025-09-10 12:18:48 + [2025-09-10 03:10:57] iteration 6399/ 11920 | consumed samples: 6552576 | elapsed time per iteration (ms): 6202.9 | throughput per GPU (TFLOP/s/GPU): 72.8 | MFU 7.36% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.871915E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:30:46.002553 | finish at 2025-09-10 12:41:43 + [2025-09-10 03:11:02] iteration 6400/ 11920 | consumed samples: 6553600 | elapsed time per iteration (ms): 5632.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.883355E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:38:10.936375 | finish at 2025-09-10 11:49:13 + [2025-09-10 03:11:08] iteration 6401/ 11920 | consumed samples: 6554624 | elapsed time per iteration (ms): 5877.5 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.873066E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:00:37.751758 | finish at 2025-09-10 12:11:46 + [2025-09-10 03:11:14] iteration 6402/ 11920 | consumed samples: 6555648 | elapsed time per iteration (ms): 6207.9 | throughput per GPU (TFLOP/s/GPU): 72.7 | MFU 7.35% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.881231E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:30:55.049061 | finish at 2025-09-10 12:42:09 + [2025-09-10 03:11:20] iteration 6403/ 11920 | consumed samples: 6556672 | elapsed time per iteration (ms): 5628.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.889451E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:37:34.371934 | finish at 2025-09-10 11:48:54 + [2025-09-10 03:11:26] iteration 6404/ 11920 | consumed samples: 6557696 | elapsed time per iteration (ms): 5618.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.871886E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:36:30.822705 | finish at 2025-09-10 11:47:56 + [2025-09-10 03:11:31] iteration 6405/ 11920 | consumed samples: 6558720 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.883315E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:36:58.297216 | finish at 2025-09-10 11:48:30 + [2025-09-10 03:11:37] iteration 6406/ 11920 | consumed samples: 6559744 | elapsed time per iteration (ms): 5854.0 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.885127E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:57:59.043271 | finish at 2025-09-10 12:09:36 + [2025-09-10 03:11:43] iteration 6407/ 11920 | consumed samples: 6560768 | elapsed time per iteration (ms): 5820.0 | throughput per GPU (TFLOP/s/GPU): 77.6 | MFU 7.84% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.871718E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:54:45.926455 | finish at 2025-09-10 12:06:29 + [2025-09-10 03:11:49] iteration 6408/ 11920 | consumed samples: 6561792 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.877537E+00 | loss scale: 1.0 | grad norm: 0.133 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:36:45.373222 | finish at 2025-09-10 11:48:34 + [2025-09-10 03:11:54] iteration 6409/ 11920 | consumed samples: 6562816 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.875029E+00 | loss scale: 1.0 | grad norm: 0.134 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:36:42.466665 | finish at 2025-09-10 11:48:37 + [2025-09-10 03:12:00] iteration 6410/ 11920 | consumed samples: 6563840 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858721E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:36:14.078860 | finish at 2025-09-10 11:48:14 + [2025-09-10 03:12:05] iteration 6411/ 11920 | consumed samples: 6564864 | elapsed time per iteration (ms): 5625.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.878062E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:36:30.217323 | finish at 2025-09-10 11:48:36 + [2025-09-10 03:12:11] iteration 6412/ 11920 | consumed samples: 6565888 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.865772E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:36:02.419713 | finish at 2025-09-10 11:48:14 + [2025-09-10 03:12:17] iteration 6413/ 11920 | consumed samples: 6566912 | elapsed time per iteration (ms): 6107.8 | throughput per GPU (TFLOP/s/GPU): 73.9 | MFU 7.47% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.881902E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:20:35.552226 | finish at 2025-09-10 12:32:53 + [2025-09-10 03:12:23] iteration 6414/ 11920 | consumed samples: 6567936 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.876016E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:35:45.882753 | finish at 2025-09-10 11:48:09 + [2025-09-10 03:12:28] iteration 6415/ 11920 | consumed samples: 6568960 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.875153E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:36:00.851458 | finish at 2025-09-10 11:48:29 + [2025-09-10 03:12:34] iteration 6416/ 11920 | consumed samples: 6569984 | elapsed time per iteration (ms): 5617.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.884962E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:35:19.649445 | finish at 2025-09-10 11:47:54 + [2025-09-10 03:12:40] iteration 6417/ 11920 | consumed samples: 6571008 | elapsed time per iteration (ms): 5612.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.890624E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:34:46.053004 | finish at 2025-09-10 11:47:26 + [2025-09-10 03:12:46] iteration 6418/ 11920 | consumed samples: 6572032 | elapsed time per iteration (ms): 5976.8 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.869457E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:08:04.426867 | finish at 2025-09-10 12:20:50 + [2025-09-10 03:12:51] iteration 6419/ 11920 | consumed samples: 6573056 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.886205E+00 | loss scale: 1.0 | grad norm: 0.142 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:35:22.212485 | finish at 2025-09-10 11:48:13 + [2025-09-10 03:12:57] iteration 6420/ 11920 | consumed samples: 6574080 | elapsed time per iteration (ms): 5621.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.892273E+00 | loss scale: 1.0 | grad norm: 0.133 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:35:15.759921 | finish at 2025-09-10 11:48:13 + [2025-09-10 03:13:03] iteration 6421/ 11920 | consumed samples: 6575104 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.869985E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 10.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:35:04.697959 | finish at 2025-09-10 11:48:07 + [2025-09-10 03:13:08] iteration 6422/ 11920 | consumed samples: 6576128 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.873662E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:35:12.184844 | finish at 2025-09-10 11:48:20 + [2025-09-10 03:13:14] iteration 6423/ 11920 | consumed samples: 6577152 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.875755E+00 | loss scale: 1.0 | grad norm: 0.122 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:35:06.233445 | finish at 2025-09-10 11:48:20 + [2025-09-10 03:13:19] iteration 6424/ 11920 | consumed samples: 6578176 | elapsed time per iteration (ms): 5627.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.885900E+00 | loss scale: 1.0 | grad norm: 0.137 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:35:25.896858 | finish at 2025-09-10 11:48:45 + [2025-09-10 03:13:25] iteration 6425/ 11920 | consumed samples: 6579200 | elapsed time per iteration (ms): 5619.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859865E+00 | loss scale: 1.0 | grad norm: 0.137 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:34:37.731911 | finish at 2025-09-10 11:48:03 + [2025-09-10 03:13:31] iteration 6426/ 11920 | consumed samples: 6580224 | elapsed time per iteration (ms): 5627.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.876751E+00 | loss scale: 1.0 | grad norm: 0.124 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:35:15.981582 | finish at 2025-09-10 11:48:47 + [2025-09-10 03:13:36] iteration 6427/ 11920 | consumed samples: 6581248 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.884561E+00 | loss scale: 1.0 | grad norm: 0.120 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:35:01.391225 | finish at 2025-09-10 11:48:38 + [2025-09-10 03:13:42] iteration 6428/ 11920 | consumed samples: 6582272 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.878762E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:34:52.299663 | finish at 2025-09-10 11:48:34 + [2025-09-10 03:13:48] iteration 6429/ 11920 | consumed samples: 6583296 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.872155E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:34:30.706918 | finish at 2025-09-10 11:48:18 + [2025-09-10 03:13:53] iteration 6430/ 11920 | consumed samples: 6584320 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.884886E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 10.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:34:11.265306 | finish at 2025-09-10 11:48:04 + [2025-09-10 03:13:59] iteration 6431/ 11920 | consumed samples: 6585344 | elapsed time per iteration (ms): 5618.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867090E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:33:58.047575 | finish at 2025-09-10 11:47:57 + [2025-09-10 03:14:05] iteration 6432/ 11920 | consumed samples: 6586368 | elapsed time per iteration (ms): 5955.8 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861672E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:04:45.575947 | finish at 2025-09-10 12:18:50 + [2025-09-10 03:14:10] iteration 6433/ 11920 | consumed samples: 6587392 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.898818E+00 | loss scale: 1.0 | grad norm: 0.229 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:34:11.511442 | finish at 2025-09-10 11:48:22 + [2025-09-10 03:14:16] iteration 6434/ 11920 | consumed samples: 6588416 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.892227E+00 | loss scale: 1.0 | grad norm: 0.277 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:34:12.525398 | finish at 2025-09-10 11:48:28 + [2025-09-10 03:14:22] iteration 6435/ 11920 | consumed samples: 6589440 | elapsed time per iteration (ms): 5959.8 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.887283E+00 | loss scale: 1.0 | grad norm: 0.244 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:04:49.372255 | finish at 2025-09-10 12:19:11 + [2025-09-10 03:14:28] iteration 6436/ 11920 | consumed samples: 6590464 | elapsed time per iteration (ms): 5832.0 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.875018E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:53:02.512891 | finish at 2025-09-10 12:07:30 + [2025-09-10 03:14:33] iteration 6437/ 11920 | consumed samples: 6591488 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.876648E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:33:36.681696 | finish at 2025-09-10 11:48:10 + [2025-09-10 03:14:39] iteration 6438/ 11920 | consumed samples: 6592512 | elapsed time per iteration (ms): 5828.3 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.872415E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:52:30.592904 | finish at 2025-09-10 12:07:10 + [2025-09-10 03:14:45] iteration 6439/ 11920 | consumed samples: 6593536 | elapsed time per iteration (ms): 5619.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870462E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:33:18.509766 | finish at 2025-09-10 11:48:03 + [2025-09-10 03:14:50] iteration 6440/ 11920 | consumed samples: 6594560 | elapsed time per iteration (ms): 5619.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.892872E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:33:13.188515 | finish at 2025-09-10 11:48:04 + [2025-09-10 03:14:56] iteration 6441/ 11920 | consumed samples: 6595584 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.885944E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:33:24.719672 | finish at 2025-09-10 11:48:21 + [2025-09-10 03:15:02] iteration 6442/ 11920 | consumed samples: 6596608 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.885789E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:33:26.345963 | finish at 2025-09-10 11:48:28 + [2025-09-10 03:15:07] iteration 6443/ 11920 | consumed samples: 6597632 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.875720E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:33:19.001246 | finish at 2025-09-10 11:48:26 + [2025-09-10 03:15:13] iteration 6444/ 11920 | consumed samples: 6598656 | elapsed time per iteration (ms): 5838.3 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.889678E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:52:50.590800 | finish at 2025-09-10 12:08:04 + [2025-09-10 03:15:19] iteration 6445/ 11920 | consumed samples: 6599680 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.884151E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:32:51.256363 | finish at 2025-09-10 11:48:10 + [2025-09-10 03:15:24] iteration 6446/ 11920 | consumed samples: 6600704 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.868176E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:32:41.005536 | finish at 2025-09-10 11:48:05 + [2025-09-10 03:15:30] iteration 6447/ 11920 | consumed samples: 6601728 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867539E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:32:55.628430 | finish at 2025-09-10 11:48:26 + [2025-09-10 03:15:36] iteration 6448/ 11920 | consumed samples: 6602752 | elapsed time per iteration (ms): 5853.4 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.888012E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:53:50.055244 | finish at 2025-09-10 12:09:26 + [2025-09-10 03:15:41] iteration 6449/ 11920 | consumed samples: 6603776 | elapsed time per iteration (ms): 5617.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.884151E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:32:11.461938 | finish at 2025-09-10 11:47:53 + [2025-09-10 03:15:47] iteration 6450/ 11920 | consumed samples: 6604800 | elapsed time per iteration (ms): 5630.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.875317E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:33:17.329135 | finish at 2025-09-10 11:49:04 + [2025-09-10 03:15:53] iteration 6451/ 11920 | consumed samples: 6605824 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.886111E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:32:45.865821 | finish at 2025-09-10 11:48:39 + [2025-09-10 03:15:59] iteration 6452/ 11920 | consumed samples: 6606848 | elapsed time per iteration (ms): 5974.3 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.874096E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:04:27.590529 | finish at 2025-09-10 12:20:26 + [2025-09-10 03:16:04] iteration 6453/ 11920 | consumed samples: 6607872 | elapsed time per iteration (ms): 5635.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.889216E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 12.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:33:27.403911 | finish at 2025-09-10 11:49:32 + [2025-09-10 03:16:10] iteration 6454/ 11920 | consumed samples: 6608896 | elapsed time per iteration (ms): 5854.1 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.899368E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:53:18.371098 | finish at 2025-09-10 12:09:29 + [2025-09-10 03:16:16] iteration 6455/ 11920 | consumed samples: 6609920 | elapsed time per iteration (ms): 5890.5 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.878195E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:56:31.842164 | finish at 2025-09-10 12:12:48 + [2025-09-10 03:16:22] iteration 6456/ 11920 | consumed samples: 6610944 | elapsed time per iteration (ms): 5955.9 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.877365E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:02:22.940947 | finish at 2025-09-10 12:18:45 + [2025-09-10 03:16:28] iteration 6457/ 11920 | consumed samples: 6611968 | elapsed time per iteration (ms): 5922.9 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.880708E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:59:17.052566 | finish at 2025-09-10 12:15:45 + [2025-09-10 03:16:34] iteration 6458/ 11920 | consumed samples: 6612992 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.875540E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:31:48.871882 | finish at 2025-09-10 11:48:22 + [2025-09-10 03:16:39] iteration 6459/ 11920 | consumed samples: 6614016 | elapsed time per iteration (ms): 5615.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.892872E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:31:05.938081 | finish at 2025-09-10 11:47:45 + [2025-09-10 03:16:45] iteration 6460/ 11920 | consumed samples: 6615040 | elapsed time per iteration (ms): 5617.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.878271E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:31:13.556385 | finish at 2025-09-10 11:47:58 + [2025-09-10 03:16:50] iteration 6461/ 11920 | consumed samples: 6616064 | elapsed time per iteration (ms): 5623.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.883329E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:31:36.280570 | finish at 2025-09-10 11:48:27 + [2025-09-10 03:16:56] iteration 6462/ 11920 | consumed samples: 6617088 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862301E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:31:41.715861 | finish at 2025-09-10 11:48:38 + [2025-09-10 03:17:02] iteration 6463/ 11920 | consumed samples: 6618112 | elapsed time per iteration (ms): 5619.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855538E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:31:02.917899 | finish at 2025-09-10 11:48:05 + [2025-09-10 03:17:08] iteration 6464/ 11920 | consumed samples: 6619136 | elapsed time per iteration (ms): 5962.6 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.888366E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:02:12.079544 | finish at 2025-09-10 12:19:20 + [2025-09-10 03:17:13] iteration 6465/ 11920 | consumed samples: 6620160 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.863576E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:31:16.745837 | finish at 2025-09-10 11:48:30 + [2025-09-10 03:17:19] iteration 6466/ 11920 | consumed samples: 6621184 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.873536E+00 | loss scale: 1.0 | grad norm: 0.133 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:31:41.937572 | finish at 2025-09-10 11:49:01 + [2025-09-10 03:17:25] iteration 6467/ 11920 | consumed samples: 6622208 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858229E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:31:00.936595 | finish at 2025-09-10 11:48:25 + [2025-09-10 03:17:30] iteration 6468/ 11920 | consumed samples: 6623232 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.880419E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:30:52.927291 | finish at 2025-09-10 11:48:23 + [2025-09-10 03:17:36] iteration 6469/ 11920 | consumed samples: 6624256 | elapsed time per iteration (ms): 5614.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.886586E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:30:06.741235 | finish at 2025-09-10 11:47:42 + [2025-09-10 03:17:41] iteration 6470/ 11920 | consumed samples: 6625280 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.882675E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:30:31.496787 | finish at 2025-09-10 11:48:13 + [2025-09-10 03:17:47] iteration 6471/ 11920 | consumed samples: 6626304 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870712E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:30:30.896217 | finish at 2025-09-10 11:48:18 + [2025-09-10 03:17:53] iteration 6472/ 11920 | consumed samples: 6627328 | elapsed time per iteration (ms): 5843.2 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.887949E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:50:33.907997 | finish at 2025-09-10 12:08:27 + [2025-09-10 03:17:58] iteration 6473/ 11920 | consumed samples: 6628352 | elapsed time per iteration (ms): 5619.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.872541E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:30:08.444672 | finish at 2025-09-10 11:48:07 + [2025-09-10 03:18:04] iteration 6474/ 11920 | consumed samples: 6629376 | elapsed time per iteration (ms): 5628.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.885358E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:30:54.134015 | finish at 2025-09-10 11:48:58 + [2025-09-10 03:18:10] iteration 6475/ 11920 | consumed samples: 6630400 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861089E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:29:59.651817 | finish at 2025-09-10 11:48:09 + [2025-09-10 03:18:15] iteration 6476/ 11920 | consumed samples: 6631424 | elapsed time per iteration (ms): 5629.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.890981E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:30:47.079293 | finish at 2025-09-10 11:49:02 + [2025-09-10 03:18:21] iteration 6477/ 11920 | consumed samples: 6632448 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.890466E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:29:49.569835 | finish at 2025-09-10 11:48:11 + [2025-09-10 03:18:27] iteration 6478/ 11920 | consumed samples: 6633472 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.874211E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:29:46.670654 | finish at 2025-09-10 11:48:13 + [2025-09-10 03:18:32] iteration 6479/ 11920 | consumed samples: 6634496 | elapsed time per iteration (ms): 5617.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.880653E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:29:24.542849 | finish at 2025-09-10 11:47:57 + [2025-09-10 03:18:38] iteration 6480/ 11920 | consumed samples: 6635520 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.872012E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:29:30.140533 | finish at 2025-09-10 11:48:08 + [2025-09-10 03:18:43] iteration 6481/ 11920 | consumed samples: 6636544 | elapsed time per iteration (ms): 5614.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.871040E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:28:55.166297 | finish at 2025-09-10 11:47:39 + [2025-09-10 03:18:49] iteration 6482/ 11920 | consumed samples: 6637568 | elapsed time per iteration (ms): 5618.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867646E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:29:12.180350 | finish at 2025-09-10 11:48:01 + [2025-09-10 03:18:55] iteration 6483/ 11920 | consumed samples: 6638592 | elapsed time per iteration (ms): 6337.5 | throughput per GPU (TFLOP/s/GPU): 71.2 | MFU 7.20% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.882854E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:34:16.911538 | finish at 2025-09-10 12:53:12 + [2025-09-10 03:19:01] iteration 6484/ 11920 | consumed samples: 6639616 | elapsed time per iteration (ms): 5630.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867012E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:30:08.763159 | finish at 2025-09-10 11:49:10 + [2025-09-10 03:19:07] iteration 6485/ 11920 | consumed samples: 6640640 | elapsed time per iteration (ms): 5633.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.872608E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:30:17.873485 | finish at 2025-09-10 11:49:25 + [2025-09-10 03:19:12] iteration 6486/ 11920 | consumed samples: 6641664 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.894952E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:29:09.408930 | finish at 2025-09-10 11:48:22 + [2025-09-10 03:19:18] iteration 6487/ 11920 | consumed samples: 6642688 | elapsed time per iteration (ms): 5942.6 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.890464E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:58:06.157522 | finish at 2025-09-10 12:17:24 + [2025-09-10 03:19:24] iteration 6488/ 11920 | consumed samples: 6643712 | elapsed time per iteration (ms): 5618.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.879244E+00 | loss scale: 1.0 | grad norm: 0.229 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:28:39.140261 | finish at 2025-09-10 11:48:03 + [2025-09-10 03:19:29] iteration 6489/ 11920 | consumed samples: 6644736 | elapsed time per iteration (ms): 5617.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.864992E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:28:30.816918 | finish at 2025-09-10 11:48:00 + [2025-09-10 03:19:35] iteration 6490/ 11920 | consumed samples: 6645760 | elapsed time per iteration (ms): 5616.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867922E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:28:15.761290 | finish at 2025-09-10 11:47:51 + [2025-09-10 03:19:41] iteration 6491/ 11920 | consumed samples: 6646784 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.879832E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:28:34.595861 | finish at 2025-09-10 11:48:15 + [2025-09-10 03:19:46] iteration 6492/ 11920 | consumed samples: 6647808 | elapsed time per iteration (ms): 5621.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870623E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:28:31.056166 | finish at 2025-09-10 11:48:17 + [2025-09-10 03:19:52] iteration 6493/ 11920 | consumed samples: 6648832 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.880228E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:28:24.614785 | finish at 2025-09-10 11:48:17 + [2025-09-10 03:19:58] iteration 6494/ 11920 | consumed samples: 6649856 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867751E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:28:19.970600 | finish at 2025-09-10 11:48:18 + [2025-09-10 03:20:03] iteration 6495/ 11920 | consumed samples: 6650880 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.881030E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:28:28.798325 | finish at 2025-09-10 11:48:32 + [2025-09-10 03:20:09] iteration 6496/ 11920 | consumed samples: 6651904 | elapsed time per iteration (ms): 5641.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.873896E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:29:58.341167 | finish at 2025-09-10 11:50:07 + [2025-09-10 03:20:14] iteration 6497/ 11920 | consumed samples: 6652928 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.877369E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:28:23.755680 | finish at 2025-09-10 11:48:38 + [2025-09-10 03:20:20] iteration 6498/ 11920 | consumed samples: 6653952 | elapsed time per iteration (ms): 5637.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.864003E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:29:23.893310 | finish at 2025-09-10 11:49:44 + [2025-09-10 03:20:26] iteration 6499/ 11920 | consumed samples: 6654976 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.874199E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:28:06.851365 | finish at 2025-09-10 11:48:33 + [2025-09-10 03:20:31] iteration 6500/ 11920 | consumed samples: 6656000 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.882501E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:28:08.771553 | finish at 2025-09-10 11:48:40 + [2025-09-10 03:20:37] iteration 6501/ 11920 | consumed samples: 6657024 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.869305E+00 | loss scale: 1.0 | grad norm: 0.264 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:27:35.913747 | finish at 2025-09-10 11:48:13 + [2025-09-10 03:20:43] iteration 6502/ 11920 | consumed samples: 6658048 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.873626E+00 | loss scale: 1.0 | grad norm: 0.250 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:27:47.135399 | finish at 2025-09-10 11:48:30 + [2025-09-10 03:20:48] iteration 6503/ 11920 | consumed samples: 6659072 | elapsed time per iteration (ms): 5617.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.876569E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:27:08.047676 | finish at 2025-09-10 11:47:56 + [2025-09-10 03:20:54] iteration 6504/ 11920 | consumed samples: 6660096 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.876522E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:27:37.336283 | finish at 2025-09-10 11:48:31 + [2025-09-10 03:20:59] iteration 6505/ 11920 | consumed samples: 6661120 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.880011E+00 | loss scale: 1.0 | grad norm: 0.229 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:27:11.970166 | finish at 2025-09-10 11:48:11 + [2025-09-10 03:21:05] iteration 6506/ 11920 | consumed samples: 6662144 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.890721E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:27:05.911355 | finish at 2025-09-10 11:48:11 + [2025-09-10 03:21:11] iteration 6507/ 11920 | consumed samples: 6663168 | elapsed time per iteration (ms): 5622.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.877925E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:27:14.327626 | finish at 2025-09-10 11:48:25 + [2025-09-10 03:21:17] iteration 6508/ 11920 | consumed samples: 6664192 | elapsed time per iteration (ms): 5926.5 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.873795E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:54:34.326831 | finish at 2025-09-10 12:15:51 + [2025-09-10 03:21:22] iteration 6509/ 11920 | consumed samples: 6665216 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.882606E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:26:50.790813 | finish at 2025-09-10 11:48:13 + [2025-09-10 03:21:28] iteration 6510/ 11920 | consumed samples: 6666240 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.873465E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:26:52.783296 | finish at 2025-09-10 11:48:21 + [2025-09-10 03:21:33] iteration 6511/ 11920 | consumed samples: 6667264 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.892530E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:26:57.185818 | finish at 2025-09-10 11:48:31 + [2025-09-10 03:21:39] iteration 6512/ 11920 | consumed samples: 6668288 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.881019E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:26:58.726105 | finish at 2025-09-10 11:48:38 + [2025-09-10 03:21:45] iteration 6513/ 11920 | consumed samples: 6669312 | elapsed time per iteration (ms): 5616.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.863643E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:26:09.421774 | finish at 2025-09-10 11:47:54 + [2025-09-10 03:21:50] iteration 6514/ 11920 | consumed samples: 6670336 | elapsed time per iteration (ms): 5618.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.876414E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:26:10.665854 | finish at 2025-09-10 11:48:01 + [2025-09-10 03:21:56] iteration 6515/ 11920 | consumed samples: 6671360 | elapsed time per iteration (ms): 5966.4 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.878847E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:57:28.609285 | finish at 2025-09-10 12:19:25 + [2025-09-10 03:22:02] iteration 6516/ 11920 | consumed samples: 6672384 | elapsed time per iteration (ms): 6190.6 | throughput per GPU (TFLOP/s/GPU): 72.9 | MFU 7.37% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.871909E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:17:34.188779 | finish at 2025-09-10 12:39:37 + [2025-09-10 03:22:09] iteration 6517/ 11920 | consumed samples: 6673408 | elapsed time per iteration (ms): 6102.6 | throughput per GPU (TFLOP/s/GPU): 74.0 | MFU 7.48% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.882581E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:09:32.593081 | finish at 2025-09-10 12:31:41 + [2025-09-10 03:22:14] iteration 6518/ 11920 | consumed samples: 6674432 | elapsed time per iteration (ms): 5881.2 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861681E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:49:30.029008 | finish at 2025-09-10 12:11:45 + [2025-09-10 03:22:20] iteration 6519/ 11920 | consumed samples: 6675456 | elapsed time per iteration (ms): 5626.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.882849E+00 | loss scale: 1.0 | grad norm: 0.248 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:26:30.546719 | finish at 2025-09-10 11:48:51 + [2025-09-10 03:22:26] iteration 6520/ 11920 | consumed samples: 6676480 | elapsed time per iteration (ms): 5816.2 | throughput per GPU (TFLOP/s/GPU): 77.6 | MFU 7.85% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.876168E+00 | loss scale: 1.0 | grad norm: 0.259 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:43:27.635880 | finish at 2025-09-10 12:05:54 + [2025-09-10 03:22:32] iteration 6521/ 11920 | consumed samples: 6677504 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.879437E+00 | loss scale: 1.0 | grad norm: 0.257 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:25:44.919071 | finish at 2025-09-10 11:48:16 + [2025-09-10 03:22:37] iteration 6522/ 11920 | consumed samples: 6678528 | elapsed time per iteration (ms): 5822.7 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.84% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.873322E+00 | loss scale: 1.0 | grad norm: 0.272 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:43:50.793159 | finish at 2025-09-10 12:06:28 + [2025-09-10 03:22:43] iteration 6523/ 11920 | consumed samples: 6679552 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.872720E+00 | loss scale: 1.0 | grad norm: 0.302 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:25:33.678130 | finish at 2025-09-10 11:48:17 + [2025-09-10 03:22:49] iteration 6524/ 11920 | consumed samples: 6680576 | elapsed time per iteration (ms): 5815.8 | throughput per GPU (TFLOP/s/GPU): 77.6 | MFU 7.85% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.882524E+00 | loss scale: 1.0 | grad norm: 0.242 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:43:02.261093 | finish at 2025-09-10 12:05:51 + [2025-09-10 03:22:54] iteration 6525/ 11920 | consumed samples: 6681600 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.885255E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:25:45.960463 | finish at 2025-09-10 11:48:40 + [2025-09-10 03:23:00] iteration 6526/ 11920 | consumed samples: 6682624 | elapsed time per iteration (ms): 5619.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.888265E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:25:09.203423 | finish at 2025-09-10 11:48:09 + [2025-09-10 03:23:06] iteration 6527/ 11920 | consumed samples: 6683648 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.873807E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:25:25.984136 | finish at 2025-09-10 11:48:32 + [2025-09-10 03:23:11] iteration 6528/ 11920 | consumed samples: 6684672 | elapsed time per iteration (ms): 5616.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.872326E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:24:46.215351 | finish at 2025-09-10 11:47:57 + [2025-09-10 03:23:17] iteration 6529/ 11920 | consumed samples: 6685696 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.881200E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:25:21.337802 | finish at 2025-09-10 11:48:38 + [2025-09-10 03:23:23] iteration 6530/ 11920 | consumed samples: 6686720 | elapsed time per iteration (ms): 5905.6 | throughput per GPU (TFLOP/s/GPU): 76.5 | MFU 7.73% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.884695E+00 | loss scale: 1.0 | grad norm: 0.142 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:50:31.267912 | finish at 2025-09-10 12:13:54 + [2025-09-10 03:23:29] iteration 6531/ 11920 | consumed samples: 6687744 | elapsed time per iteration (ms): 5995.2 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.891788E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:58:28.205598 | finish at 2025-09-10 12:21:57 + [2025-09-10 03:23:34] iteration 6532/ 11920 | consumed samples: 6688768 | elapsed time per iteration (ms): 5629.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870749E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:25:29.934242 | finish at 2025-09-10 11:49:04 + [2025-09-10 03:23:40] iteration 6533/ 11920 | consumed samples: 6689792 | elapsed time per iteration (ms): 5613.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.868314E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:24:00.765109 | finish at 2025-09-10 11:47:41 + [2025-09-10 03:23:46] iteration 6534/ 11920 | consumed samples: 6690816 | elapsed time per iteration (ms): 5988.5 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.866696E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:57:34.312041 | finish at 2025-09-10 12:21:20 + [2025-09-10 03:23:52] iteration 6535/ 11920 | consumed samples: 6691840 | elapsed time per iteration (ms): 5626.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.888967E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:24:59.775242 | finish at 2025-09-10 11:48:51 + [2025-09-10 03:23:57] iteration 6536/ 11920 | consumed samples: 6692864 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.871595E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:25:08.181356 | finish at 2025-09-10 11:49:05 + [2025-09-10 03:24:03] iteration 6537/ 11920 | consumed samples: 6693888 | elapsed time per iteration (ms): 5914.4 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.878255E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:50:37.301730 | finish at 2025-09-10 12:14:41 + [2025-09-10 03:24:09] iteration 6538/ 11920 | consumed samples: 6694912 | elapsed time per iteration (ms): 5978.4 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.872262E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 11.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:56:15.977161 | finish at 2025-09-10 12:20:25 + [2025-09-10 03:24:15] iteration 6539/ 11920 | consumed samples: 6695936 | elapsed time per iteration (ms): 5613.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.865601E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:23:24.491653 | finish at 2025-09-10 11:47:39 + [2025-09-10 03:24:20] iteration 6540/ 11920 | consumed samples: 6696960 | elapsed time per iteration (ms): 5633.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.873022E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:25:10.574012 | finish at 2025-09-10 11:49:31 + [2025-09-10 03:24:26] iteration 6541/ 11920 | consumed samples: 6697984 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.885181E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:24:02.638483 | finish at 2025-09-10 11:48:29 + [2025-09-10 03:24:32] iteration 6542/ 11920 | consumed samples: 6699008 | elapsed time per iteration (ms): 5619.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.874223E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:23:40.906379 | finish at 2025-09-10 11:48:13 + [2025-09-10 03:24:37] iteration 6543/ 11920 | consumed samples: 6700032 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867208E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:23:49.158009 | finish at 2025-09-10 11:48:26 + [2025-09-10 03:24:43] iteration 6544/ 11920 | consumed samples: 6701056 | elapsed time per iteration (ms): 5615.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.884758E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:23:10.500549 | finish at 2025-09-10 11:47:53 + [2025-09-10 03:24:49] iteration 6545/ 11920 | consumed samples: 6702080 | elapsed time per iteration (ms): 5623.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862593E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:23:44.408776 | finish at 2025-09-10 11:48:33 + [2025-09-10 03:24:54] iteration 6546/ 11920 | consumed samples: 6703104 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.873374E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:23:50.117106 | finish at 2025-09-10 11:48:44 + [2025-09-10 03:25:00] iteration 6547/ 11920 | consumed samples: 6704128 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855004E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:23:21.147769 | finish at 2025-09-10 11:48:21 + [2025-09-10 03:25:06] iteration 6548/ 11920 | consumed samples: 6705152 | elapsed time per iteration (ms): 5942.6 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.873229E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:52:03.447461 | finish at 2025-09-10 12:17:09 + [2025-09-10 03:25:11] iteration 6549/ 11920 | consumed samples: 6706176 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.871207E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:23:32.502468 | finish at 2025-09-10 11:48:44 + [2025-09-10 03:25:17] iteration 6550/ 11920 | consumed samples: 6707200 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.865867E+00 | loss scale: 1.0 | grad norm: 0.137 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:23:49.455578 | finish at 2025-09-10 11:49:06 + [2025-09-10 03:25:23] iteration 6551/ 11920 | consumed samples: 6708224 | elapsed time per iteration (ms): 5630.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.875872E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:23:48.574034 | finish at 2025-09-10 11:49:11 + [2025-09-10 03:25:29] iteration 6552/ 11920 | consumed samples: 6709248 | elapsed time per iteration (ms): 5985.3 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867983E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:55:29.131699 | finish at 2025-09-10 12:20:58 + [2025-09-10 03:25:34] iteration 6553/ 11920 | consumed samples: 6710272 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860341E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:22:50.356416 | finish at 2025-09-10 11:48:25 + [2025-09-10 03:25:40] iteration 6554/ 11920 | consumed samples: 6711296 | elapsed time per iteration (ms): 5616.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.881318E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:22:17.292815 | finish at 2025-09-10 11:47:57 + [2025-09-10 03:25:45] iteration 6555/ 11920 | consumed samples: 6712320 | elapsed time per iteration (ms): 5619.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.869024E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:22:27.184471 | finish at 2025-09-10 11:48:13 + [2025-09-10 03:25:51] iteration 6556/ 11920 | consumed samples: 6713344 | elapsed time per iteration (ms): 5628.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.886702E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:23:08.566535 | finish at 2025-09-10 11:49:00 + [2025-09-10 03:25:57] iteration 6557/ 11920 | consumed samples: 6714368 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.865769E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:22:37.173967 | finish at 2025-09-10 11:48:34 + [2025-09-10 03:26:02] iteration 6558/ 11920 | consumed samples: 6715392 | elapsed time per iteration (ms): 5622.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.872121E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:22:28.482615 | finish at 2025-09-10 11:48:31 + [2025-09-10 03:26:08] iteration 6559/ 11920 | consumed samples: 6716416 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.868015E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:22:21.781227 | finish at 2025-09-10 11:48:30 + [2025-09-10 03:26:14] iteration 6560/ 11920 | consumed samples: 6717440 | elapsed time per iteration (ms): 5948.5 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.873657E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:51:24.116745 | finish at 2025-09-10 12:17:38 + [2025-09-10 03:26:20] iteration 6561/ 11920 | consumed samples: 6718464 | elapsed time per iteration (ms): 5855.3 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870848E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:42:58.414440 | finish at 2025-09-10 12:09:18 + [2025-09-10 03:26:25] iteration 6562/ 11920 | consumed samples: 6719488 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.889745E+00 | loss scale: 1.0 | grad norm: 0.245 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:21:54.225577 | finish at 2025-09-10 11:48:20 + [2025-09-10 03:26:31] iteration 6563/ 11920 | consumed samples: 6720512 | elapsed time per iteration (ms): 5618.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.863805E+00 | loss scale: 1.0 | grad norm: 0.245 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:21:40.014651 | finish at 2025-09-10 11:48:11 + [2025-09-10 03:26:37] iteration 6564/ 11920 | consumed samples: 6721536 | elapsed time per iteration (ms): 5614.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.877261E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:21:13.148330 | finish at 2025-09-10 11:47:50 + [2025-09-10 03:26:42] iteration 6565/ 11920 | consumed samples: 6722560 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.878425E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:21:41.877555 | finish at 2025-09-10 11:48:24 + [2025-09-10 03:26:48] iteration 6566/ 11920 | consumed samples: 6723584 | elapsed time per iteration (ms): 5623.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.884497E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:21:45.624472 | finish at 2025-09-10 11:48:33 + [2025-09-10 03:26:54] iteration 6567/ 11920 | consumed samples: 6724608 | elapsed time per iteration (ms): 5901.6 | throughput per GPU (TFLOP/s/GPU): 76.5 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.875836E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:46:31.185051 | finish at 2025-09-10 12:13:25 + [2025-09-10 03:26:59] iteration 6568/ 11920 | consumed samples: 6725632 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.865646E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:21:36.578293 | finish at 2025-09-10 11:48:36 + [2025-09-10 03:27:05] iteration 6569/ 11920 | consumed samples: 6726656 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862078E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:21:30.464968 | finish at 2025-09-10 11:48:35 + [2025-09-10 03:27:11] iteration 6570/ 11920 | consumed samples: 6727680 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870003E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:21:30.094304 | finish at 2025-09-10 11:48:41 + [2025-09-10 03:27:16] iteration 6571/ 11920 | consumed samples: 6728704 | elapsed time per iteration (ms): 5631.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.883558E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:22:01.303230 | finish at 2025-09-10 11:49:18 + [2025-09-10 03:27:22] iteration 6572/ 11920 | consumed samples: 6729728 | elapsed time per iteration (ms): 5629.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.863631E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:21:45.072433 | finish at 2025-09-10 11:49:07 + [2025-09-10 03:27:28] iteration 6573/ 11920 | consumed samples: 6730752 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.873099E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:21:09.990950 | finish at 2025-09-10 11:48:38 + [2025-09-10 03:27:33] iteration 6574/ 11920 | consumed samples: 6731776 | elapsed time per iteration (ms): 5616.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.863513E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:20:28.070859 | finish at 2025-09-10 11:48:01 + [2025-09-10 03:27:39] iteration 6575/ 11920 | consumed samples: 6732800 | elapsed time per iteration (ms): 5616.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.877790E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:20:20.026305 | finish at 2025-09-10 11:47:59 + [2025-09-10 03:27:45] iteration 6576/ 11920 | consumed samples: 6733824 | elapsed time per iteration (ms): 5929.5 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.875016E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:48:07.147896 | finish at 2025-09-10 12:15:52 + [2025-09-10 03:27:51] iteration 6577/ 11920 | consumed samples: 6734848 | elapsed time per iteration (ms): 5874.9 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.874537E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:43:09.525007 | finish at 2025-09-10 12:11:00 + [2025-09-10 03:27:56] iteration 6578/ 11920 | consumed samples: 6735872 | elapsed time per iteration (ms): 5619.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.873445E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:20:17.734512 | finish at 2025-09-10 11:48:14 + [2025-09-10 03:28:02] iteration 6579/ 11920 | consumed samples: 6736896 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.871426E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:20:41.451761 | finish at 2025-09-10 11:48:43 + [2025-09-10 03:28:07] iteration 6580/ 11920 | consumed samples: 6737920 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870998E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:20:20.428262 | finish at 2025-09-10 11:48:28 + [2025-09-10 03:28:13] iteration 6581/ 11920 | consumed samples: 6738944 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.875659E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:20:25.562606 | finish at 2025-09-10 11:48:39 + [2025-09-10 03:28:19] iteration 6582/ 11920 | consumed samples: 6739968 | elapsed time per iteration (ms): 5626.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.879328E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:20:34.937303 | finish at 2025-09-10 11:48:54 + [2025-09-10 03:28:24] iteration 6583/ 11920 | consumed samples: 6740992 | elapsed time per iteration (ms): 5619.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.868797E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:19:49.641087 | finish at 2025-09-10 11:48:14 + [2025-09-10 03:28:30] iteration 6584/ 11920 | consumed samples: 6742016 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.874684E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:19:59.640711 | finish at 2025-09-10 11:48:30 + [2025-09-10 03:28:36] iteration 6585/ 11920 | consumed samples: 6743040 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.885695E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:20:03.588840 | finish at 2025-09-10 11:48:39 + [2025-09-10 03:28:41] iteration 6586/ 11920 | consumed samples: 6744064 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867121E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:19:56.446485 | finish at 2025-09-10 11:48:38 + [2025-09-10 03:28:47] iteration 6587/ 11920 | consumed samples: 6745088 | elapsed time per iteration (ms): 5616.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.864972E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:19:14.347727 | finish at 2025-09-10 11:48:01 + [2025-09-10 03:28:52] iteration 6588/ 11920 | consumed samples: 6746112 | elapsed time per iteration (ms): 5617.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.874639E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:19:10.125495 | finish at 2025-09-10 11:48:03 + [2025-09-10 03:28:58] iteration 6589/ 11920 | consumed samples: 6747136 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.868666E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:19:46.473361 | finish at 2025-09-10 11:48:44 + [2025-09-10 03:29:04] iteration 6590/ 11920 | consumed samples: 6748160 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861991E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:19:20.837605 | finish at 2025-09-10 11:48:24 + [2025-09-10 03:29:09] iteration 6591/ 11920 | consumed samples: 6749184 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.871508E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:19:48.896984 | finish at 2025-09-10 11:48:58 + [2025-09-10 03:29:15] iteration 6592/ 11920 | consumed samples: 6750208 | elapsed time per iteration (ms): 5630.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.864519E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:19:58.341534 | finish at 2025-09-10 11:49:13 + [2025-09-10 03:29:21] iteration 6593/ 11920 | consumed samples: 6751232 | elapsed time per iteration (ms): 5959.6 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.869509E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:49:06.736977 | finish at 2025-09-10 12:18:28 + [2025-09-10 03:29:27] iteration 6594/ 11920 | consumed samples: 6752256 | elapsed time per iteration (ms): 6294.4 | throughput per GPU (TFLOP/s/GPU): 71.7 | MFU 7.25% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.877043E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:18:44.065703 | finish at 2025-09-10 12:48:11 + [2025-09-10 03:29:33] iteration 6595/ 11920 | consumed samples: 6753280 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.885205E+00 | loss scale: 1.0 | grad norm: 0.260 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:19:03.460965 | finish at 2025-09-10 11:48:36 + [2025-09-10 03:29:39] iteration 6596/ 11920 | consumed samples: 6754304 | elapsed time per iteration (ms): 6322.5 | throughput per GPU (TFLOP/s/GPU): 71.4 | MFU 7.22% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.880644E+00 | loss scale: 1.0 | grad norm: 0.289 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:21:00.868093 | finish at 2025-09-10 12:50:40 + [2025-09-10 03:29:45] iteration 6597/ 11920 | consumed samples: 6755328 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.884364E+00 | loss scale: 1.0 | grad norm: 0.292 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:18:41.387885 | finish at 2025-09-10 11:48:26 + [2025-09-10 03:29:51] iteration 6598/ 11920 | consumed samples: 6756352 | elapsed time per iteration (ms): 5829.6 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.882838E+00 | loss scale: 1.0 | grad norm: 0.244 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:37:04.923616 | finish at 2025-09-10 12:06:55 + [2025-09-10 03:29:56] iteration 6599/ 11920 | consumed samples: 6757376 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.880663E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:18:32.897231 | finish at 2025-09-10 11:48:29 + [2025-09-10 03:30:02] iteration 6600/ 11920 | consumed samples: 6758400 | elapsed time per iteration (ms): 5619.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.878193E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:18:14.547300 | finish at 2025-09-10 11:48:16 + [2025-09-10 03:30:07] iteration 6601/ 11920 | consumed samples: 6759424 | elapsed time per iteration (ms): 5619.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.883862E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:18:11.421204 | finish at 2025-09-10 11:48:19 + [2025-09-10 03:30:13] iteration 6602/ 11920 | consumed samples: 6760448 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861658E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:18:16.037297 | finish at 2025-09-10 11:48:29 + [2025-09-10 03:30:19] iteration 6603/ 11920 | consumed samples: 6761472 | elapsed time per iteration (ms): 5997.2 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.874747E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:51:27.169510 | finish at 2025-09-10 12:21:46 + [2025-09-10 03:30:25] iteration 6604/ 11920 | consumed samples: 6762496 | elapsed time per iteration (ms): 5617.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.871510E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:17:44.626562 | finish at 2025-09-10 11:48:09 + [2025-09-10 03:30:30] iteration 6605/ 11920 | consumed samples: 6763520 | elapsed time per iteration (ms): 5612.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.865032E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:17:08.283015 | finish at 2025-09-10 11:47:39 + [2025-09-10 03:30:36] iteration 6606/ 11920 | consumed samples: 6764544 | elapsed time per iteration (ms): 6158.0 | throughput per GPU (TFLOP/s/GPU): 73.3 | MFU 7.41% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.872356E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:05:23.463726 | finish at 2025-09-10 12:36:00 + [2025-09-10 03:30:42] iteration 6607/ 11920 | consumed samples: 6765568 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870303E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:17:43.877987 | finish at 2025-09-10 11:48:26 + [2025-09-10 03:30:48] iteration 6608/ 11920 | consumed samples: 6766592 | elapsed time per iteration (ms): 5617.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860154E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:17:19.327011 | finish at 2025-09-10 11:48:07 + [2025-09-10 03:30:53] iteration 6609/ 11920 | consumed samples: 6767616 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.871041E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:17:39.309264 | finish at 2025-09-10 11:48:33 + [2025-09-10 03:30:59] iteration 6610/ 11920 | consumed samples: 6768640 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.869698E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:17:47.808094 | finish at 2025-09-10 11:48:47 + [2025-09-10 03:31:05] iteration 6611/ 11920 | consumed samples: 6769664 | elapsed time per iteration (ms): 5615.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.880482E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:16:54.979126 | finish at 2025-09-10 11:48:00 + [2025-09-10 03:31:10] iteration 6612/ 11920 | consumed samples: 6770688 | elapsed time per iteration (ms): 5618.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.863801E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:17:03.014421 | finish at 2025-09-10 11:48:13 + [2025-09-10 03:31:16] iteration 6613/ 11920 | consumed samples: 6771712 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870867E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:17:12.885567 | finish at 2025-09-10 11:48:29 + [2025-09-10 03:31:21] iteration 6614/ 11920 | consumed samples: 6772736 | elapsed time per iteration (ms): 5632.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.864283E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:18:04.851705 | finish at 2025-09-10 11:49:26 + [2025-09-10 03:31:27] iteration 6615/ 11920 | consumed samples: 6773760 | elapsed time per iteration (ms): 5626.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.883492E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:17:28.588247 | finish at 2025-09-10 11:48:56 + [2025-09-10 03:31:33] iteration 6616/ 11920 | consumed samples: 6774784 | elapsed time per iteration (ms): 5618.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.866665E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:16:41.765779 | finish at 2025-09-10 11:48:14 + [2025-09-10 03:31:38] iteration 6617/ 11920 | consumed samples: 6775808 | elapsed time per iteration (ms): 5615.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.881507E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:16:16.323557 | finish at 2025-09-10 11:47:55 + [2025-09-10 03:31:44] iteration 6618/ 11920 | consumed samples: 6776832 | elapsed time per iteration (ms): 5618.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.869519E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:16:27.193628 | finish at 2025-09-10 11:48:11 + [2025-09-10 03:31:49] iteration 6619/ 11920 | consumed samples: 6777856 | elapsed time per iteration (ms): 5619.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.865315E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:16:29.395005 | finish at 2025-09-10 11:48:19 + [2025-09-10 03:31:55] iteration 6620/ 11920 | consumed samples: 6778880 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.874880E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:16:48.848143 | finish at 2025-09-10 11:48:44 + [2025-09-10 03:32:01] iteration 6621/ 11920 | consumed samples: 6779904 | elapsed time per iteration (ms): 5626.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861829E+00 | loss scale: 1.0 | grad norm: 0.243 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:16:54.379478 | finish at 2025-09-10 11:48:55 + [2025-09-10 03:32:06] iteration 6622/ 11920 | consumed samples: 6780928 | elapsed time per iteration (ms): 5617.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.869999E+00 | loss scale: 1.0 | grad norm: 0.243 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:15:59.037070 | finish at 2025-09-10 11:48:05 + [2025-09-10 03:32:12] iteration 6623/ 11920 | consumed samples: 6781952 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.883486E+00 | loss scale: 1.0 | grad norm: 0.249 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:16:14.776994 | finish at 2025-09-10 11:48:27 + [2025-09-10 03:32:18] iteration 6624/ 11920 | consumed samples: 6782976 | elapsed time per iteration (ms): 5643.6 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.876765E+00 | loss scale: 1.0 | grad norm: 0.243 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:18:08.260574 | finish at 2025-09-10 11:50:26 + [2025-09-10 03:32:23] iteration 6625/ 11920 | consumed samples: 6784000 | elapsed time per iteration (ms): 5834.8 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.873360E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:34:55.276126 | finish at 2025-09-10 12:07:19 + [2025-09-10 03:32:29] iteration 6626/ 11920 | consumed samples: 6785024 | elapsed time per iteration (ms): 5618.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.864684E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:15:45.015502 | finish at 2025-09-10 11:48:14 + [2025-09-10 03:32:35] iteration 6627/ 11920 | consumed samples: 6786048 | elapsed time per iteration (ms): 5618.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.889678E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:15:38.579131 | finish at 2025-09-10 11:48:13 + [2025-09-10 03:32:40] iteration 6628/ 11920 | consumed samples: 6787072 | elapsed time per iteration (ms): 5616.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.891402E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:15:24.680048 | finish at 2025-09-10 11:48:05 + [2025-09-10 03:32:46] iteration 6629/ 11920 | consumed samples: 6788096 | elapsed time per iteration (ms): 5629.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.877579E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:16:23.084141 | finish at 2025-09-10 11:49:09 + [2025-09-10 03:32:52] iteration 6630/ 11920 | consumed samples: 6789120 | elapsed time per iteration (ms): 5619.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870129E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:15:27.636378 | finish at 2025-09-10 11:48:19 + [2025-09-10 03:32:57] iteration 6631/ 11920 | consumed samples: 6790144 | elapsed time per iteration (ms): 5632.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855041E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:16:29.331268 | finish at 2025-09-10 11:49:27 + [2025-09-10 03:33:03] iteration 6632/ 11920 | consumed samples: 6791168 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.869769E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:15:25.517515 | finish at 2025-09-10 11:48:28 + [2025-09-10 03:33:09] iteration 6633/ 11920 | consumed samples: 6792192 | elapsed time per iteration (ms): 5844.6 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.871953E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:35:00.247478 | finish at 2025-09-10 12:08:09 + [2025-09-10 03:33:14] iteration 6634/ 11920 | consumed samples: 6793216 | elapsed time per iteration (ms): 5628.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.872898E+00 | loss scale: 1.0 | grad norm: 0.142 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:15:54.477835 | finish at 2025-09-10 11:49:09 + [2025-09-10 03:33:20] iteration 6635/ 11920 | consumed samples: 6794240 | elapsed time per iteration (ms): 5631.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856580E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:16:01.173387 | finish at 2025-09-10 11:49:21 + [2025-09-10 03:33:26] iteration 6636/ 11920 | consumed samples: 6795264 | elapsed time per iteration (ms): 5632.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858429E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:15:59.780113 | finish at 2025-09-10 11:49:25 + [2025-09-10 03:33:31] iteration 6637/ 11920 | consumed samples: 6796288 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.868177E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:14:50.740278 | finish at 2025-09-10 11:48:22 + [2025-09-10 03:33:37] iteration 6638/ 11920 | consumed samples: 6797312 | elapsed time per iteration (ms): 5618.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.868949E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:14:37.967248 | finish at 2025-09-10 11:48:15 + [2025-09-10 03:33:42] iteration 6639/ 11920 | consumed samples: 6798336 | elapsed time per iteration (ms): 5618.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.863974E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:14:29.388433 | finish at 2025-09-10 11:48:12 + [2025-09-10 03:33:48] iteration 6640/ 11920 | consumed samples: 6799360 | elapsed time per iteration (ms): 5628.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.869343E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:15:17.423744 | finish at 2025-09-10 11:49:05 + [2025-09-10 03:33:54] iteration 6641/ 11920 | consumed samples: 6800384 | elapsed time per iteration (ms): 5628.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867808E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:15:14.054652 | finish at 2025-09-10 11:49:08 + [2025-09-10 03:33:59] iteration 6642/ 11920 | consumed samples: 6801408 | elapsed time per iteration (ms): 5626.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.874288E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:14:55.014182 | finish at 2025-09-10 11:48:54 + [2025-09-10 03:34:05] iteration 6643/ 11920 | consumed samples: 6802432 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.863194E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:14:40.086604 | finish at 2025-09-10 11:48:45 + [2025-09-10 03:34:11] iteration 6644/ 11920 | consumed samples: 6803456 | elapsed time per iteration (ms): 5618.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.872126E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:14:01.209688 | finish at 2025-09-10 11:48:12 + [2025-09-10 03:34:16] iteration 6645/ 11920 | consumed samples: 6804480 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.863242E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:14:24.825827 | finish at 2025-09-10 11:48:41 + [2025-09-10 03:34:22] iteration 6646/ 11920 | consumed samples: 6805504 | elapsed time per iteration (ms): 5634.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.873785E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:15:16.206023 | finish at 2025-09-10 11:49:38 + [2025-09-10 03:34:27] iteration 6647/ 11920 | consumed samples: 6806528 | elapsed time per iteration (ms): 5619.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862112E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:13:49.702115 | finish at 2025-09-10 11:48:17 + [2025-09-10 03:34:33] iteration 6648/ 11920 | consumed samples: 6807552 | elapsed time per iteration (ms): 5963.6 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.873036E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:44:00.087467 | finish at 2025-09-10 12:18:33 + [2025-09-10 03:34:39] iteration 6649/ 11920 | consumed samples: 6808576 | elapsed time per iteration (ms): 6018.2 | throughput per GPU (TFLOP/s/GPU): 75.0 | MFU 7.59% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862146E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:48:42.155475 | finish at 2025-09-10 12:23:22 + [2025-09-10 03:34:45] iteration 6650/ 11920 | consumed samples: 6809600 | elapsed time per iteration (ms): 5624.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.878777E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:14:01.412973 | finish at 2025-09-10 11:48:46 + [2025-09-10 03:34:51] iteration 6651/ 11920 | consumed samples: 6810624 | elapsed time per iteration (ms): 5950.0 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870502E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:42:30.655774 | finish at 2025-09-10 12:17:22 + [2025-09-10 03:34:57] iteration 6652/ 11920 | consumed samples: 6811648 | elapsed time per iteration (ms): 5631.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870973E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:14:27.452920 | finish at 2025-09-10 11:49:24 + [2025-09-10 03:35:02] iteration 6653/ 11920 | consumed samples: 6812672 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854222E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:13:39.900561 | finish at 2025-09-10 11:48:42 + [2025-09-10 03:35:08] iteration 6654/ 11920 | consumed samples: 6813696 | elapsed time per iteration (ms): 5927.5 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.875674E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:40:14.040534 | finish at 2025-09-10 12:15:22 + [2025-09-10 03:35:14] iteration 6655/ 11920 | consumed samples: 6814720 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867393E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:13:15.815524 | finish at 2025-09-10 11:48:30 + [2025-09-10 03:35:19] iteration 6656/ 11920 | consumed samples: 6815744 | elapsed time per iteration (ms): 5628.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.873587E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:13:47.900570 | finish at 2025-09-10 11:49:07 + [2025-09-10 03:35:25] iteration 6657/ 11920 | consumed samples: 6816768 | elapsed time per iteration (ms): 5629.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856379E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:13:49.915138 | finish at 2025-09-10 11:49:15 + [2025-09-10 03:35:31] iteration 6658/ 11920 | consumed samples: 6817792 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.872333E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:13:01.808441 | finish at 2025-09-10 11:48:32 + [2025-09-10 03:35:36] iteration 6659/ 11920 | consumed samples: 6818816 | elapsed time per iteration (ms): 5617.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.872761E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:12:33.069541 | finish at 2025-09-10 11:48:09 + [2025-09-10 03:35:42] iteration 6660/ 11920 | consumed samples: 6819840 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861620E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:12:45.127182 | finish at 2025-09-10 11:48:27 + [2025-09-10 03:35:48] iteration 6661/ 11920 | consumed samples: 6820864 | elapsed time per iteration (ms): 5618.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.878347E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:12:25.615105 | finish at 2025-09-10 11:48:13 + [2025-09-10 03:35:53] iteration 6662/ 11920 | consumed samples: 6821888 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.887859E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:12:31.327081 | finish at 2025-09-10 11:48:24 + [2025-09-10 03:35:59] iteration 6663/ 11920 | consumed samples: 6822912 | elapsed time per iteration (ms): 5629.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870526E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:13:15.880334 | finish at 2025-09-10 11:49:15 + [2025-09-10 03:36:04] iteration 6664/ 11920 | consumed samples: 6823936 | elapsed time per iteration (ms): 5616.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.866068E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:11:59.851049 | finish at 2025-09-10 11:48:04 + [2025-09-10 03:36:10] iteration 6665/ 11920 | consumed samples: 6824960 | elapsed time per iteration (ms): 5638.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846897E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:13:48.765039 | finish at 2025-09-10 11:49:59 + [2025-09-10 03:36:16] iteration 6666/ 11920 | consumed samples: 6825984 | elapsed time per iteration (ms): 5619.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858686E+00 | loss scale: 1.0 | grad norm: 0.130 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:12:06.065156 | finish at 2025-09-10 11:48:22 + [2025-09-10 03:36:21] iteration 6667/ 11920 | consumed samples: 6827008 | elapsed time per iteration (ms): 5634.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.868018E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:13:17.517656 | finish at 2025-09-10 11:49:39 + [2025-09-10 03:36:27] iteration 6668/ 11920 | consumed samples: 6828032 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860712E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:12:07.592864 | finish at 2025-09-10 11:48:34 + [2025-09-10 03:36:33] iteration 6669/ 11920 | consumed samples: 6829056 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.865501E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:12:15.232460 | finish at 2025-09-10 11:48:48 + [2025-09-10 03:36:38] iteration 6670/ 11920 | consumed samples: 6830080 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.863152E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:12:24.142485 | finish at 2025-09-10 11:49:02 + [2025-09-10 03:36:44] iteration 6671/ 11920 | consumed samples: 6831104 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859100E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:11:41.008799 | finish at 2025-09-10 11:48:25 + [2025-09-10 03:36:49] iteration 6672/ 11920 | consumed samples: 6832128 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.874701E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:12:01.825531 | finish at 2025-09-10 11:48:51 + [2025-09-10 03:36:55] iteration 6673/ 11920 | consumed samples: 6833152 | elapsed time per iteration (ms): 5953.1 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860431E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:40:35.761651 | finish at 2025-09-10 12:17:31 + [2025-09-10 03:37:01] iteration 6674/ 11920 | consumed samples: 6834176 | elapsed time per iteration (ms): 5890.8 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870659E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:35:03.359428 | finish at 2025-09-10 12:12:05 + [2025-09-10 03:37:07] iteration 6675/ 11920 | consumed samples: 6835200 | elapsed time per iteration (ms): 6240.4 | throughput per GPU (TFLOP/s/GPU): 72.3 | MFU 7.32% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.874924E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:05:31.024699 | finish at 2025-09-10 12:42:39 + [2025-09-10 03:37:13] iteration 6676/ 11920 | consumed samples: 6836224 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859042E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:11:30.210943 | finish at 2025-09-10 11:48:43 + [2025-09-10 03:37:19] iteration 6677/ 11920 | consumed samples: 6837248 | elapsed time per iteration (ms): 5632.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.881555E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:12:09.980872 | finish at 2025-09-10 11:49:29 + [2025-09-10 03:37:24] iteration 6678/ 11920 | consumed samples: 6838272 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.875420E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:11:26.952382 | finish at 2025-09-10 11:48:51 + [2025-09-10 03:37:30] iteration 6679/ 11920 | consumed samples: 6839296 | elapsed time per iteration (ms): 6078.6 | throughput per GPU (TFLOP/s/GPU): 74.3 | MFU 7.51% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870028E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:50:58.108423 | finish at 2025-09-10 12:28:29 + [2025-09-10 03:37:36] iteration 6680/ 11920 | consumed samples: 6840320 | elapsed time per iteration (ms): 5893.2 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852316E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:34:40.346346 | finish at 2025-09-10 12:12:17 + [2025-09-10 03:37:42] iteration 6681/ 11920 | consumed samples: 6841344 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851319E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:10:42.338773 | finish at 2025-09-10 11:48:24 + [2025-09-10 03:37:48] iteration 6682/ 11920 | consumed samples: 6842368 | elapsed time per iteration (ms): 5926.5 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.868680E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:37:23.121074 | finish at 2025-09-10 12:15:11 + [2025-09-10 03:37:54] iteration 6683/ 11920 | consumed samples: 6843392 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.874709E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:10:36.811430 | finish at 2025-09-10 11:48:30 + [2025-09-10 03:37:59] iteration 6684/ 11920 | consumed samples: 6844416 | elapsed time per iteration (ms): 5852.5 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.875869E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:30:43.601167 | finish at 2025-09-10 12:08:43 + [2025-09-10 03:38:05] iteration 6685/ 11920 | consumed samples: 6845440 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.864229E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:10:26.830173 | finish at 2025-09-10 11:48:32 + [2025-09-10 03:38:11] iteration 6686/ 11920 | consumed samples: 6846464 | elapsed time per iteration (ms): 5616.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.877899E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:09:55.678566 | finish at 2025-09-10 11:48:06 + [2025-09-10 03:38:16] iteration 6687/ 11920 | consumed samples: 6847488 | elapsed time per iteration (ms): 5618.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.866645E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:10:01.569298 | finish at 2025-09-10 11:48:18 + [2025-09-10 03:38:22] iteration 6688/ 11920 | consumed samples: 6848512 | elapsed time per iteration (ms): 5627.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.874271E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:10:41.099419 | finish at 2025-09-10 11:49:03 + [2025-09-10 03:38:28] iteration 6689/ 11920 | consumed samples: 6849536 | elapsed time per iteration (ms): 5920.2 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.872531E+00 | loss scale: 1.0 | grad norm: 0.257 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:36:08.550560 | finish at 2025-09-10 12:14:36 + [2025-09-10 03:38:33] iteration 6690/ 11920 | consumed samples: 6850560 | elapsed time per iteration (ms): 5629.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.890979E+00 | loss scale: 1.0 | grad norm: 0.275 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:10:42.732189 | finish at 2025-09-10 11:49:16 + [2025-09-10 03:38:39] iteration 6691/ 11920 | consumed samples: 6851584 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.874502E+00 | loss scale: 1.0 | grad norm: 0.253 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:10:05.662309 | finish at 2025-09-10 11:48:45 + [2025-09-10 03:38:45] iteration 6692/ 11920 | consumed samples: 6852608 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.883932E+00 | loss scale: 1.0 | grad norm: 0.230 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:09:40.402126 | finish at 2025-09-10 11:48:25 + [2025-09-10 03:38:50] iteration 6693/ 11920 | consumed samples: 6853632 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857449E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:10:04.609193 | finish at 2025-09-10 11:48:55 + [2025-09-10 03:38:56] iteration 6694/ 11920 | consumed samples: 6854656 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.877490E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:09:27.674798 | finish at 2025-09-10 11:48:24 + [2025-09-10 03:39:02] iteration 6695/ 11920 | consumed samples: 6855680 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.878902E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:09:39.541677 | finish at 2025-09-10 11:48:41 + [2025-09-10 03:39:08] iteration 6696/ 11920 | consumed samples: 6856704 | elapsed time per iteration (ms): 6333.6 | throughput per GPU (TFLOP/s/GPU): 71.3 | MFU 7.21% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.873811E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:11:26.765242 | finish at 2025-09-10 12:50:35 + [2025-09-10 03:39:13] iteration 6697/ 11920 | consumed samples: 6857728 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867273E+00 | loss scale: 1.0 | grad norm: 0.246 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:09:33.552163 | finish at 2025-09-10 11:48:47 + [2025-09-10 03:39:19] iteration 6698/ 11920 | consumed samples: 6858752 | elapsed time per iteration (ms): 5628.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.873706E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:09:53.644204 | finish at 2025-09-10 11:49:13 + [2025-09-10 03:39:25] iteration 6699/ 11920 | consumed samples: 6859776 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.880721E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:09:10.541190 | finish at 2025-09-10 11:48:35 + [2025-09-10 03:39:30] iteration 6700/ 11920 | consumed samples: 6860800 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.887046E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:09:44.857006 | finish at 2025-09-10 11:49:15 + [2025-09-10 03:39:36] iteration 6701/ 11920 | consumed samples: 6861824 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.871525E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:08:52.796424 | finish at 2025-09-10 11:48:29 + [2025-09-10 03:39:42] iteration 6702/ 11920 | consumed samples: 6862848 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856167E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:08:47.187234 | finish at 2025-09-10 11:48:29 + [2025-09-10 03:39:47] iteration 6703/ 11920 | consumed samples: 6863872 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.880141E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:08:48.610653 | finish at 2025-09-10 11:48:36 + [2025-09-10 03:39:53] iteration 6704/ 11920 | consumed samples: 6864896 | elapsed time per iteration (ms): 5615.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.873316E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:08:12.681351 | finish at 2025-09-10 11:48:06 + [2025-09-10 03:39:59] iteration 6705/ 11920 | consumed samples: 6865920 | elapsed time per iteration (ms): 6247.9 | throughput per GPU (TFLOP/s/GPU): 72.3 | MFU 7.31% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.871300E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:03:02.891799 | finish at 2025-09-10 12:43:02 + [2025-09-10 03:40:05] iteration 6706/ 11920 | consumed samples: 6866944 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.878004E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:08:25.703901 | finish at 2025-09-10 11:48:30 + [2025-09-10 03:40:10] iteration 6707/ 11920 | consumed samples: 6867968 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.878386E+00 | loss scale: 1.0 | grad norm: 0.136 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:08:26.554976 | finish at 2025-09-10 11:48:37 + [2025-09-10 03:40:16] iteration 6708/ 11920 | consumed samples: 6868992 | elapsed time per iteration (ms): 5837.2 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.863485E+00 | loss scale: 1.0 | grad norm: 0.127 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:27:03.442524 | finish at 2025-09-10 12:07:20 + [2025-09-10 03:40:22] iteration 6709/ 11920 | consumed samples: 6870016 | elapsed time per iteration (ms): 5626.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859576E+00 | loss scale: 1.0 | grad norm: 0.125 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:08:41.512291 | finish at 2025-09-10 11:49:03 + [2025-09-10 03:40:27] iteration 6710/ 11920 | consumed samples: 6871040 | elapsed time per iteration (ms): 5626.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870425E+00 | loss scale: 1.0 | grad norm: 0.119 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:08:32.273238 | finish at 2025-09-10 11:49:00 + [2025-09-10 03:40:33] iteration 6711/ 11920 | consumed samples: 6872064 | elapsed time per iteration (ms): 5962.1 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859187E+00 | loss scale: 1.0 | grad norm: 0.109 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:37:36.598922 | finish at 2025-09-10 12:18:10 + [2025-09-10 03:40:39] iteration 6712/ 11920 | consumed samples: 6873088 | elapsed time per iteration (ms): 5617.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862258E+00 | loss scale: 1.0 | grad norm: 0.112 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:07:35.268597 | finish at 2025-09-10 11:48:14 + [2025-09-10 03:40:45] iteration 6713/ 11920 | consumed samples: 6874112 | elapsed time per iteration (ms): 5616.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857208E+00 | loss scale: 1.0 | grad norm: 0.113 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:07:26.113106 | finish at 2025-09-10 11:48:11 + [2025-09-10 03:40:51] iteration 6714/ 11920 | consumed samples: 6875136 | elapsed time per iteration (ms): 5984.1 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862245E+00 | loss scale: 1.0 | grad norm: 0.116 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:39:13.206520 | finish at 2025-09-10 12:20:04 + [2025-09-10 03:40:56] iteration 6715/ 11920 | consumed samples: 6876160 | elapsed time per iteration (ms): 5626.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859611E+00 | loss scale: 1.0 | grad norm: 0.132 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:08:04.300060 | finish at 2025-09-10 11:49:01 + [2025-09-10 03:41:02] iteration 6716/ 11920 | consumed samples: 6877184 | elapsed time per iteration (ms): 5975.1 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.873739E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:38:14.289093 | finish at 2025-09-10 12:19:16 + [2025-09-10 03:41:08] iteration 6717/ 11920 | consumed samples: 6878208 | elapsed time per iteration (ms): 6170.0 | throughput per GPU (TFLOP/s/GPU): 73.2 | MFU 7.40% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.868402E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:55:02.741128 | finish at 2025-09-10 12:36:11 + [2025-09-10 03:41:14] iteration 6718/ 11920 | consumed samples: 6879232 | elapsed time per iteration (ms): 5917.5 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.881639E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:33:02.897112 | finish at 2025-09-10 12:14:17 + [2025-09-10 03:41:20] iteration 6719/ 11920 | consumed samples: 6880256 | elapsed time per iteration (ms): 5629.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.871510E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:07:58.602479 | finish at 2025-09-10 11:49:19 + [2025-09-10 03:41:26] iteration 6720/ 11920 | consumed samples: 6881280 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.877786E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:07:06.616573 | finish at 2025-09-10 11:48:32 + [2025-09-10 03:41:31] iteration 6721/ 11920 | consumed samples: 6882304 | elapsed time per iteration (ms): 5618.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.872788E+00 | loss scale: 1.0 | grad norm: 0.230 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:06:48.725882 | finish at 2025-09-10 11:48:20 + [2025-09-10 03:41:37] iteration 6722/ 11920 | consumed samples: 6883328 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.869978E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:06:53.488113 | finish at 2025-09-10 11:48:30 + [2025-09-10 03:41:42] iteration 6723/ 11920 | consumed samples: 6884352 | elapsed time per iteration (ms): 5618.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.864744E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:06:37.023708 | finish at 2025-09-10 11:48:19 + [2025-09-10 03:41:48] iteration 6724/ 11920 | consumed samples: 6885376 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867445E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:06:40.483749 | finish at 2025-09-10 11:48:28 + [2025-09-10 03:41:54] iteration 6725/ 11920 | consumed samples: 6886400 | elapsed time per iteration (ms): 5880.7 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856474E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:29:10.359699 | finish at 2025-09-10 12:11:04 + [2025-09-10 03:42:00] iteration 6726/ 11920 | consumed samples: 6887424 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.853246E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:06:42.817660 | finish at 2025-09-10 11:48:42 + [2025-09-10 03:42:05] iteration 6727/ 11920 | consumed samples: 6888448 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.881234E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:06:24.217401 | finish at 2025-09-10 11:48:29 + [2025-09-10 03:42:11] iteration 6728/ 11920 | consumed samples: 6889472 | elapsed time per iteration (ms): 5616.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.853864E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:06:02.281132 | finish at 2025-09-10 11:48:13 + [2025-09-10 03:42:16] iteration 6729/ 11920 | consumed samples: 6890496 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867824E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:06:21.621186 | finish at 2025-09-10 11:48:38 + [2025-09-10 03:42:22] iteration 6730/ 11920 | consumed samples: 6891520 | elapsed time per iteration (ms): 5627.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858047E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:06:48.612320 | finish at 2025-09-10 11:49:11 + [2025-09-10 03:42:28] iteration 6731/ 11920 | consumed samples: 6892544 | elapsed time per iteration (ms): 5622.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850823E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 12.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:06:15.741088 | finish at 2025-09-10 11:48:43 + [2025-09-10 03:42:33] iteration 6732/ 11920 | consumed samples: 6893568 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.885598E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:06:03.200406 | finish at 2025-09-10 11:48:36 + [2025-09-10 03:42:39] iteration 6733/ 11920 | consumed samples: 6894592 | elapsed time per iteration (ms): 5617.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.873140E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:05:39.339375 | finish at 2025-09-10 11:48:18 + [2025-09-10 03:42:44] iteration 6734/ 11920 | consumed samples: 6895616 | elapsed time per iteration (ms): 5616.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859008E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:05:24.732701 | finish at 2025-09-10 11:48:09 + [2025-09-10 03:42:50] iteration 6735/ 11920 | consumed samples: 6896640 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867436E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:05:41.936929 | finish at 2025-09-10 11:48:32 + [2025-09-10 03:42:56] iteration 6736/ 11920 | consumed samples: 6897664 | elapsed time per iteration (ms): 5835.1 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.878030E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:24:09.298965 | finish at 2025-09-10 12:07:05 + [2025-09-10 03:43:02] iteration 6737/ 11920 | consumed samples: 6898688 | elapsed time per iteration (ms): 5629.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.877502E+00 | loss scale: 1.0 | grad norm: 0.257 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:06:15.085726 | finish at 2025-09-10 11:49:17 + [2025-09-10 03:43:07] iteration 6738/ 11920 | consumed samples: 6899712 | elapsed time per iteration (ms): 5933.1 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.866669E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:32:25.125210 | finish at 2025-09-10 12:15:33 + [2025-09-10 03:43:13] iteration 6739/ 11920 | consumed samples: 6900736 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.869560E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:05:20.659569 | finish at 2025-09-10 11:48:34 + [2025-09-10 03:43:19] iteration 6740/ 11920 | consumed samples: 6901760 | elapsed time per iteration (ms): 5629.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.864464E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:06:03.027620 | finish at 2025-09-10 11:49:22 + [2025-09-10 03:43:24] iteration 6741/ 11920 | consumed samples: 6902784 | elapsed time per iteration (ms): 5625.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870646E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:05:34.205011 | finish at 2025-09-10 11:48:59 + [2025-09-10 03:43:30] iteration 6742/ 11920 | consumed samples: 6903808 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856833E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:05:06.116027 | finish at 2025-09-10 11:48:36 + [2025-09-10 03:43:36] iteration 6743/ 11920 | consumed samples: 6904832 | elapsed time per iteration (ms): 5624.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.863244E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:05:18.413147 | finish at 2025-09-10 11:48:54 + [2025-09-10 03:43:42] iteration 6744/ 11920 | consumed samples: 6905856 | elapsed time per iteration (ms): 5972.4 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.866683E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:35:13.029842 | finish at 2025-09-10 12:18:55 + [2025-09-10 03:43:47] iteration 6745/ 11920 | consumed samples: 6906880 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.877698E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:04:49.800507 | finish at 2025-09-10 11:48:37 + [2025-09-10 03:43:53] iteration 6746/ 11920 | consumed samples: 6907904 | elapsed time per iteration (ms): 5992.6 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862893E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:36:45.886149 | finish at 2025-09-10 12:20:39 + [2025-09-10 03:43:59] iteration 6747/ 11920 | consumed samples: 6908928 | elapsed time per iteration (ms): 5628.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.868553E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:05:15.037782 | finish at 2025-09-10 11:49:14 + [2025-09-10 03:44:04] iteration 6748/ 11920 | consumed samples: 6909952 | elapsed time per iteration (ms): 5616.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.868001E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:04:06.248852 | finish at 2025-09-10 11:48:11 + [2025-09-10 03:44:10] iteration 6749/ 11920 | consumed samples: 6910976 | elapsed time per iteration (ms): 5632.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.871531E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:05:27.231520 | finish at 2025-09-10 11:49:37 + [2025-09-10 03:44:16] iteration 6750/ 11920 | consumed samples: 6912000 | elapsed time per iteration (ms): 5989.2 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856876E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:36:03.945189 | finish at 2025-09-10 12:20:20 + [2025-09-10 03:44:22] iteration 6751/ 11920 | consumed samples: 6913024 | elapsed time per iteration (ms): 5632.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841915E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:05:14.261522 | finish at 2025-09-10 11:49:36 + [2025-09-10 03:44:27] iteration 6752/ 11920 | consumed samples: 6914048 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870938E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:04:13.400513 | finish at 2025-09-10 11:48:41 + [2025-09-10 03:44:33] iteration 6753/ 11920 | consumed samples: 6915072 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.863231E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:04:27.927825 | finish at 2025-09-10 11:49:01 + [2025-09-10 03:44:39] iteration 6754/ 11920 | consumed samples: 6916096 | elapsed time per iteration (ms): 5618.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.869030E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:03:42.793847 | finish at 2025-09-10 11:48:21 + [2025-09-10 03:44:44] iteration 6755/ 11920 | consumed samples: 6917120 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.866094E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:03:57.750572 | finish at 2025-09-10 11:48:42 + [2025-09-10 03:44:50] iteration 6756/ 11920 | consumed samples: 6918144 | elapsed time per iteration (ms): 5617.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861001E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:03:27.315074 | finish at 2025-09-10 11:48:17 + [2025-09-10 03:44:55] iteration 6757/ 11920 | consumed samples: 6919168 | elapsed time per iteration (ms): 5618.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860981E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:03:29.441794 | finish at 2025-09-10 11:48:25 + [2025-09-10 03:45:01] iteration 6758/ 11920 | consumed samples: 6920192 | elapsed time per iteration (ms): 5929.0 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.864927E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:30:05.282113 | finish at 2025-09-10 12:15:07 + [2025-09-10 03:45:07] iteration 6759/ 11920 | consumed samples: 6921216 | elapsed time per iteration (ms): 5621.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.868719E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:03:29.784389 | finish at 2025-09-10 11:48:37 + [2025-09-10 03:45:13] iteration 6760/ 11920 | consumed samples: 6922240 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870436E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:03:18.851252 | finish at 2025-09-10 11:48:31 + [2025-09-10 03:45:18] iteration 6761/ 11920 | consumed samples: 6923264 | elapsed time per iteration (ms): 5633.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862175E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:04:21.204889 | finish at 2025-09-10 11:49:39 + [2025-09-10 03:45:24] iteration 6762/ 11920 | consumed samples: 6924288 | elapsed time per iteration (ms): 5629.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.866304E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:03:59.167972 | finish at 2025-09-10 11:49:23 + [2025-09-10 03:45:29] iteration 6763/ 11920 | consumed samples: 6925312 | elapsed time per iteration (ms): 5618.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.868199E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:02:56.028260 | finish at 2025-09-10 11:48:26 + [2025-09-10 03:45:35] iteration 6764/ 11920 | consumed samples: 6926336 | elapsed time per iteration (ms): 5618.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.874109E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:02:48.724133 | finish at 2025-09-10 11:48:24 + [2025-09-10 03:45:41] iteration 6765/ 11920 | consumed samples: 6927360 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.847512E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:03:01.687657 | finish at 2025-09-10 11:48:42 + [2025-09-10 03:45:46] iteration 6766/ 11920 | consumed samples: 6928384 | elapsed time per iteration (ms): 5618.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856130E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:02:37.149312 | finish at 2025-09-10 11:48:23 + [2025-09-10 03:45:52] iteration 6767/ 11920 | consumed samples: 6929408 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.869142E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:02:48.353750 | finish at 2025-09-10 11:48:40 + [2025-09-10 03:45:58] iteration 6768/ 11920 | consumed samples: 6930432 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855610E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:02:51.993729 | finish at 2025-09-10 11:48:50 + [2025-09-10 03:46:03] iteration 6769/ 11920 | consumed samples: 6931456 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867782E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:02:39.888402 | finish at 2025-09-10 11:48:43 + [2025-09-10 03:46:09] iteration 6770/ 11920 | consumed samples: 6932480 | elapsed time per iteration (ms): 5618.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.853178E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:02:12.604384 | finish at 2025-09-10 11:48:21 + [2025-09-10 03:46:14] iteration 6771/ 11920 | consumed samples: 6933504 | elapsed time per iteration (ms): 5617.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855748E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:02:05.520628 | finish at 2025-09-10 11:48:20 + [2025-09-10 03:46:20] iteration 6772/ 11920 | consumed samples: 6934528 | elapsed time per iteration (ms): 5625.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862040E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:02:42.378831 | finish at 2025-09-10 11:49:02 + [2025-09-10 03:46:26] iteration 6773/ 11920 | consumed samples: 6935552 | elapsed time per iteration (ms): 5617.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.887888E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:01:53.177127 | finish at 2025-09-10 11:48:19 + [2025-09-10 03:46:31] iteration 6774/ 11920 | consumed samples: 6936576 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852399E+00 | loss scale: 1.0 | grad norm: 0.241 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:02:02.616187 | finish at 2025-09-10 11:48:34 + [2025-09-10 03:46:37] iteration 6775/ 11920 | consumed samples: 6937600 | elapsed time per iteration (ms): 5628.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.876438E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:02:37.638824 | finish at 2025-09-10 11:49:15 + [2025-09-10 03:46:43] iteration 6776/ 11920 | consumed samples: 6938624 | elapsed time per iteration (ms): 5617.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859119E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:01:36.126001 | finish at 2025-09-10 11:48:19 + [2025-09-10 03:46:48] iteration 6777/ 11920 | consumed samples: 6939648 | elapsed time per iteration (ms): 5632.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858558E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:02:47.739931 | finish at 2025-09-10 11:49:36 + [2025-09-10 03:46:54] iteration 6778/ 11920 | consumed samples: 6940672 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.864329E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:01:36.796300 | finish at 2025-09-10 11:48:31 + [2025-09-10 03:46:59] iteration 6779/ 11920 | consumed samples: 6941696 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856588E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:01:58.310082 | finish at 2025-09-10 11:48:58 + [2025-09-10 03:47:05] iteration 6780/ 11920 | consumed samples: 6942720 | elapsed time per iteration (ms): 5621.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.873187E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:01:31.966000 | finish at 2025-09-10 11:48:37 + [2025-09-10 03:47:11] iteration 6781/ 11920 | consumed samples: 6943744 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.872528E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:01:22.899639 | finish at 2025-09-10 11:48:34 + [2025-09-10 03:47:16] iteration 6782/ 11920 | consumed samples: 6944768 | elapsed time per iteration (ms): 5615.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861519E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:00:50.165273 | finish at 2025-09-10 11:48:06 + [2025-09-10 03:47:22] iteration 6783/ 11920 | consumed samples: 6945792 | elapsed time per iteration (ms): 5629.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855295E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:01:56.293986 | finish at 2025-09-10 11:49:18 + [2025-09-10 03:47:28] iteration 6784/ 11920 | consumed samples: 6946816 | elapsed time per iteration (ms): 5951.8 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858267E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:29:28.447803 | finish at 2025-09-10 12:16:56 + [2025-09-10 03:47:33] iteration 6785/ 11920 | consumed samples: 6947840 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862509E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:01:04.504945 | finish at 2025-09-10 11:48:38 + [2025-09-10 03:47:39] iteration 6786/ 11920 | consumed samples: 6948864 | elapsed time per iteration (ms): 5619.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854388E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:00:51.097690 | finish at 2025-09-10 11:48:30 + [2025-09-10 03:47:45] iteration 6787/ 11920 | consumed samples: 6949888 | elapsed time per iteration (ms): 5617.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.868566E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:00:35.783112 | finish at 2025-09-10 11:48:21 + [2025-09-10 03:47:50] iteration 6788/ 11920 | consumed samples: 6950912 | elapsed time per iteration (ms): 5619.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861996E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:00:40.283039 | finish at 2025-09-10 11:48:31 + [2025-09-10 03:47:56] iteration 6789/ 11920 | consumed samples: 6951936 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.871089E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:00:43.465171 | finish at 2025-09-10 11:48:39 + [2025-09-10 03:48:02] iteration 6790/ 11920 | consumed samples: 6952960 | elapsed time per iteration (ms): 5859.6 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862737E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:20:59.816837 | finish at 2025-09-10 12:09:02 + [2025-09-10 03:48:08] iteration 6791/ 11920 | consumed samples: 6953984 | elapsed time per iteration (ms): 6400.7 | throughput per GPU (TFLOP/s/GPU): 70.5 | MFU 7.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.872740E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 9:07:09.308167 | finish at 2025-09-10 12:55:18 + [2025-09-10 03:48:14] iteration 6792/ 11920 | consumed samples: 6955008 | elapsed time per iteration (ms): 5634.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.872659E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:01:33.444717 | finish at 2025-09-10 11:49:47 + [2025-09-10 03:48:19] iteration 6793/ 11920 | consumed samples: 6956032 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.874967E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:00:31.824408 | finish at 2025-09-10 11:48:51 + [2025-09-10 03:48:25] iteration 6794/ 11920 | consumed samples: 6957056 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854629E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:00:12.396881 | finish at 2025-09-10 11:48:38 + [2025-09-10 03:48:31] iteration 6795/ 11920 | consumed samples: 6958080 | elapsed time per iteration (ms): 5832.3 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855608E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:18:10.676528 | finish at 2025-09-10 12:06:42 + [2025-09-10 03:48:37] iteration 6796/ 11920 | consumed samples: 6959104 | elapsed time per iteration (ms): 5882.8 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.865530E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:22:23.310130 | finish at 2025-09-10 12:11:00 + [2025-09-10 03:48:43] iteration 6797/ 11920 | consumed samples: 6960128 | elapsed time per iteration (ms): 5867.8 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858667E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:21:00.887178 | finish at 2025-09-10 12:09:44 + [2025-09-10 03:48:48] iteration 6798/ 11920 | consumed samples: 6961152 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860598E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:59:47.780142 | finish at 2025-09-10 11:48:36 + [2025-09-10 03:48:54] iteration 6799/ 11920 | consumed samples: 6962176 | elapsed time per iteration (ms): 5615.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.875529E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:59:14.268536 | finish at 2025-09-10 11:48:08 + [2025-09-10 03:49:00] iteration 6800/ 11920 | consumed samples: 6963200 | elapsed time per iteration (ms): 5618.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860798E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:59:26.251221 | finish at 2025-09-10 11:48:26 + [2025-09-10 03:49:06] iteration 6801/ 11920 | consumed samples: 6964224 | elapsed time per iteration (ms): 6192.5 | throughput per GPU (TFLOP/s/GPU): 72.9 | MFU 7.37% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852464E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:48:19.261630 | finish at 2025-09-10 12:37:25 + [2025-09-10 03:49:11] iteration 6802/ 11920 | consumed samples: 6965248 | elapsed time per iteration (ms): 5633.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860653E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:00:30.409746 | finish at 2025-09-10 11:49:42 + [2025-09-10 03:49:17] iteration 6803/ 11920 | consumed samples: 6966272 | elapsed time per iteration (ms): 5618.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.863158E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:59:07.872231 | finish at 2025-09-10 11:48:25 + [2025-09-10 03:49:23] iteration 6804/ 11920 | consumed samples: 6967296 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859444E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:59:31.370759 | finish at 2025-09-10 11:48:54 + [2025-09-10 03:49:29] iteration 6805/ 11920 | consumed samples: 6968320 | elapsed time per iteration (ms): 5905.7 | throughput per GPU (TFLOP/s/GPU): 76.4 | MFU 7.73% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861818E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:23:27.754118 | finish at 2025-09-10 12:12:56 + [2025-09-10 03:49:34] iteration 6806/ 11920 | consumed samples: 6969344 | elapsed time per iteration (ms): 5619.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.876330E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:58:56.853337 | finish at 2025-09-10 11:48:31 + [2025-09-10 03:49:40] iteration 6807/ 11920 | consumed samples: 6970368 | elapsed time per iteration (ms): 5977.2 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867517E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:29:21.600520 | finish at 2025-09-10 12:19:02 + [2025-09-10 03:49:46] iteration 6808/ 11920 | consumed samples: 6971392 | elapsed time per iteration (ms): 5628.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846991E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:59:32.689602 | finish at 2025-09-10 11:49:18 + [2025-09-10 03:49:51] iteration 6809/ 11920 | consumed samples: 6972416 | elapsed time per iteration (ms): 5616.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.864388E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:58:23.311095 | finish at 2025-09-10 11:48:15 + [2025-09-10 03:49:57] iteration 6810/ 11920 | consumed samples: 6973440 | elapsed time per iteration (ms): 5622.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858326E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:58:50.894299 | finish at 2025-09-10 11:48:48 + [2025-09-10 03:50:03] iteration 6811/ 11920 | consumed samples: 6974464 | elapsed time per iteration (ms): 6255.9 | throughput per GPU (TFLOP/s/GPU): 72.2 | MFU 7.30% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.863362E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:52:41.314663 | finish at 2025-09-10 12:42:45 + [2025-09-10 03:50:09] iteration 6812/ 11920 | consumed samples: 6975488 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859520E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:58:46.083190 | finish at 2025-09-10 11:48:55 + [2025-09-10 03:50:15] iteration 6813/ 11920 | consumed samples: 6976512 | elapsed time per iteration (ms): 6137.3 | throughput per GPU (TFLOP/s/GPU): 73.6 | MFU 7.44% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.868400E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:42:23.344592 | finish at 2025-09-10 12:32:38 + [2025-09-10 03:50:21] iteration 6814/ 11920 | consumed samples: 6977536 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867095E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:58:47.750731 | finish at 2025-09-10 11:49:08 + [2025-09-10 03:50:26] iteration 6815/ 11920 | consumed samples: 6978560 | elapsed time per iteration (ms): 5619.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862434E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:58:05.151795 | finish at 2025-09-10 11:48:31 + [2025-09-10 03:50:32] iteration 6816/ 11920 | consumed samples: 6979584 | elapsed time per iteration (ms): 5617.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.877144E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:57:51.013329 | finish at 2025-09-10 11:48:23 + [2025-09-10 03:50:37] iteration 6817/ 11920 | consumed samples: 6980608 | elapsed time per iteration (ms): 5635.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.875549E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:59:15.390352 | finish at 2025-09-10 11:49:53 + [2025-09-10 03:50:43] iteration 6818/ 11920 | consumed samples: 6981632 | elapsed time per iteration (ms): 5628.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856613E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:58:36.323498 | finish at 2025-09-10 11:49:19 + [2025-09-10 03:50:49] iteration 6819/ 11920 | consumed samples: 6982656 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859514E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:57:56.578964 | finish at 2025-09-10 11:48:45 + [2025-09-10 03:50:54] iteration 6820/ 11920 | consumed samples: 6983680 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.890163E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:58:11.606212 | finish at 2025-09-10 11:49:06 + [2025-09-10 03:51:00] iteration 6821/ 11920 | consumed samples: 6984704 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.869918E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:57:56.611035 | finish at 2025-09-10 11:48:57 + [2025-09-10 03:51:06] iteration 6822/ 11920 | consumed samples: 6985728 | elapsed time per iteration (ms): 5627.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867465E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:58:08.612422 | finish at 2025-09-10 11:49:14 + [2025-09-10 03:51:11] iteration 6823/ 11920 | consumed samples: 6986752 | elapsed time per iteration (ms): 5618.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860019E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:57:19.323376 | finish at 2025-09-10 11:48:31 + [2025-09-10 03:51:17] iteration 6824/ 11920 | consumed samples: 6987776 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.871222E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:57:50.637512 | finish at 2025-09-10 11:49:08 + [2025-09-10 03:51:23] iteration 6825/ 11920 | consumed samples: 6988800 | elapsed time per iteration (ms): 5839.9 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.865225E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:15:54.154534 | finish at 2025-09-10 12:07:17 + [2025-09-10 03:51:28] iteration 6826/ 11920 | consumed samples: 6989824 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.864368E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:57:25.502302 | finish at 2025-09-10 11:48:54 + [2025-09-10 03:51:34] iteration 6827/ 11920 | consumed samples: 6990848 | elapsed time per iteration (ms): 5952.7 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.863333E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:25:17.001019 | finish at 2025-09-10 12:16:51 + [2025-09-10 03:51:40] iteration 6828/ 11920 | consumed samples: 6991872 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.876969E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:56:56.204166 | finish at 2025-09-10 11:48:36 + [2025-09-10 03:51:46] iteration 6829/ 11920 | consumed samples: 6992896 | elapsed time per iteration (ms): 5888.9 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.863593E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:19:40.348843 | finish at 2025-09-10 12:11:26 + [2025-09-10 03:51:51] iteration 6830/ 11920 | consumed samples: 6993920 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.876159E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:57:00.405712 | finish at 2025-09-10 11:48:52 + [2025-09-10 03:51:57] iteration 6831/ 11920 | consumed samples: 6994944 | elapsed time per iteration (ms): 5829.4 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857882E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:14:25.870304 | finish at 2025-09-10 12:06:23 + [2025-09-10 03:52:03] iteration 6832/ 11920 | consumed samples: 6995968 | elapsed time per iteration (ms): 5616.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855272E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:56:14.746284 | finish at 2025-09-10 11:48:18 + [2025-09-10 03:52:08] iteration 6833/ 11920 | consumed samples: 6996992 | elapsed time per iteration (ms): 5617.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.878327E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:56:16.990564 | finish at 2025-09-10 11:48:25 + [2025-09-10 03:52:14] iteration 6834/ 11920 | consumed samples: 6998016 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857838E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:56:32.012525 | finish at 2025-09-10 11:48:46 + [2025-09-10 03:52:20] iteration 6835/ 11920 | consumed samples: 6999040 | elapsed time per iteration (ms): 5632.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.847625E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:57:23.144953 | finish at 2025-09-10 11:49:43 + [2025-09-10 03:52:25] iteration 6836/ 11920 | consumed samples: 7000064 | elapsed time per iteration (ms): 5626.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856208E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:56:43.376358 | finish at 2025-09-10 11:49:09 + [2025-09-10 03:52:31] iteration 6837/ 11920 | consumed samples: 7001088 | elapsed time per iteration (ms): 5890.9 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857875E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:19:03.644135 | finish at 2025-09-10 12:11:35 + [2025-09-10 03:52:37] iteration 6838/ 11920 | consumed samples: 7002112 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846747E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:56:14.842379 | finish at 2025-09-10 11:48:52 + [2025-09-10 03:52:43] iteration 6839/ 11920 | consumed samples: 7003136 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849418E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:56:13.023435 | finish at 2025-09-10 11:48:56 + [2025-09-10 03:52:48] iteration 6840/ 11920 | consumed samples: 7004160 | elapsed time per iteration (ms): 5619.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.869790E+00 | loss scale: 1.0 | grad norm: 0.130 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:55:46.664762 | finish at 2025-09-10 11:48:35 + [2025-09-10 03:52:54] iteration 6841/ 11920 | consumed samples: 7005184 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867945E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:56:00.193744 | finish at 2025-09-10 11:48:54 + [2025-09-10 03:52:59] iteration 6842/ 11920 | consumed samples: 7006208 | elapsed time per iteration (ms): 5618.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.853021E+00 | loss scale: 1.0 | grad norm: 0.126 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:55:30.900361 | finish at 2025-09-10 11:48:30 + [2025-09-10 03:53:05] iteration 6843/ 11920 | consumed samples: 7007232 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857077E+00 | loss scale: 1.0 | grad norm: 0.130 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:55:50.594784 | finish at 2025-09-10 11:48:56 + [2025-09-10 03:53:11] iteration 6844/ 11920 | consumed samples: 7008256 | elapsed time per iteration (ms): 5944.8 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.868287E+00 | loss scale: 1.0 | grad norm: 0.133 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:22:55.580403 | finish at 2025-09-10 12:16:07 + [2025-09-10 03:53:17] iteration 6845/ 11920 | consumed samples: 7009280 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855727E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:55:44.560319 | finish at 2025-09-10 11:49:01 + [2025-09-10 03:53:22] iteration 6846/ 11920 | consumed samples: 7010304 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870769E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:55:43.560596 | finish at 2025-09-10 11:49:06 + [2025-09-10 03:53:28] iteration 6847/ 11920 | consumed samples: 7011328 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857257E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:55:23.819095 | finish at 2025-09-10 11:48:52 + [2025-09-10 03:53:33] iteration 6848/ 11920 | consumed samples: 7012352 | elapsed time per iteration (ms): 5629.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.853699E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:55:53.985653 | finish at 2025-09-10 11:49:27 + [2025-09-10 03:53:39] iteration 6849/ 11920 | consumed samples: 7013376 | elapsed time per iteration (ms): 5630.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858662E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:55:53.584938 | finish at 2025-09-10 11:49:33 + [2025-09-10 03:53:45] iteration 6850/ 11920 | consumed samples: 7014400 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.864828E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:55:01.748478 | finish at 2025-09-10 11:48:46 + [2025-09-10 03:53:50] iteration 6851/ 11920 | consumed samples: 7015424 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.869282E+00 | loss scale: 1.0 | grad norm: 0.245 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:54:51.488441 | finish at 2025-09-10 11:48:42 + [2025-09-10 03:53:56] iteration 6852/ 11920 | consumed samples: 7016448 | elapsed time per iteration (ms): 5618.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.874688E+00 | loss scale: 1.0 | grad norm: 0.259 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:54:34.557971 | finish at 2025-09-10 11:48:30 + [2025-09-10 03:54:02] iteration 6853/ 11920 | consumed samples: 7017472 | elapsed time per iteration (ms): 5626.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857143E+00 | loss scale: 1.0 | grad norm: 0.273 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:55:07.819897 | finish at 2025-09-10 11:49:09 + [2025-09-10 03:54:07] iteration 6854/ 11920 | consumed samples: 7018496 | elapsed time per iteration (ms): 5918.3 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.868082E+00 | loss scale: 1.0 | grad norm: 0.263 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:19:42.083205 | finish at 2025-09-10 12:13:50 + [2025-09-10 03:54:13] iteration 6855/ 11920 | consumed samples: 7019520 | elapsed time per iteration (ms): 5627.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870082E+00 | loss scale: 1.0 | grad norm: 0.274 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:55:04.984454 | finish at 2025-09-10 11:49:18 + [2025-09-10 03:54:19] iteration 6856/ 11920 | consumed samples: 7020544 | elapsed time per iteration (ms): 5967.2 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.865097E+00 | loss scale: 1.0 | grad norm: 0.251 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:23:37.660744 | finish at 2025-09-10 12:17:57 + [2025-09-10 03:54:25] iteration 6857/ 11920 | consumed samples: 7021568 | elapsed time per iteration (ms): 5843.4 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.880210E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:13:05.107651 | finish at 2025-09-10 12:07:30 + [2025-09-10 03:54:31] iteration 6858/ 11920 | consumed samples: 7022592 | elapsed time per iteration (ms): 5995.4 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.876553E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:25:48.929380 | finish at 2025-09-10 12:20:20 + [2025-09-10 03:54:37] iteration 6859/ 11920 | consumed samples: 7023616 | elapsed time per iteration (ms): 5635.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.868792E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:55:23.006445 | finish at 2025-09-10 11:50:00 + [2025-09-10 03:54:42] iteration 6860/ 11920 | consumed samples: 7024640 | elapsed time per iteration (ms): 5619.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.865404E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:53:53.422189 | finish at 2025-09-10 11:48:36 + [2025-09-10 03:54:48] iteration 6861/ 11920 | consumed samples: 7025664 | elapsed time per iteration (ms): 5965.6 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.875751E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:23:00.066962 | finish at 2025-09-10 12:17:48 + [2025-09-10 03:54:54] iteration 6862/ 11920 | consumed samples: 7026688 | elapsed time per iteration (ms): 5616.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.872011E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:53:28.469946 | finish at 2025-09-10 11:48:22 + [2025-09-10 03:54:59] iteration 6863/ 11920 | consumed samples: 7027712 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.871724E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:53:52.393837 | finish at 2025-09-10 11:48:52 + [2025-09-10 03:55:05] iteration 6864/ 11920 | consumed samples: 7028736 | elapsed time per iteration (ms): 5969.4 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851305E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:23:01.264343 | finish at 2025-09-10 12:18:07 + [2025-09-10 03:55:11] iteration 6865/ 11920 | consumed samples: 7029760 | elapsed time per iteration (ms): 5616.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.865409E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:53:12.899044 | finish at 2025-09-10 11:48:24 + [2025-09-10 03:55:17] iteration 6866/ 11920 | consumed samples: 7030784 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862753E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:53:45.901457 | finish at 2025-09-10 11:49:02 + [2025-09-10 03:55:22] iteration 6867/ 11920 | consumed samples: 7031808 | elapsed time per iteration (ms): 5851.7 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.874508E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:12:48.548882 | finish at 2025-09-10 12:08:11 + [2025-09-10 03:55:28] iteration 6868/ 11920 | consumed samples: 7032832 | elapsed time per iteration (ms): 5613.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855267E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:52:37.057454 | finish at 2025-09-10 11:48:05 + [2025-09-10 03:55:34] iteration 6869/ 11920 | consumed samples: 7033856 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.863689E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:53:23.483770 | finish at 2025-09-10 11:48:57 + [2025-09-10 03:55:39] iteration 6870/ 11920 | consumed samples: 7034880 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.881567E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:53:11.524911 | finish at 2025-09-10 11:48:51 + [2025-09-10 03:55:45] iteration 6871/ 11920 | consumed samples: 7035904 | elapsed time per iteration (ms): 5615.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857093E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:52:33.834250 | finish at 2025-09-10 11:48:19 + [2025-09-10 03:55:51] iteration 6872/ 11920 | consumed samples: 7036928 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.876276E+00 | loss scale: 1.0 | grad norm: 0.244 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:52:51.098959 | finish at 2025-09-10 11:48:42 + [2025-09-10 03:55:56] iteration 6873/ 11920 | consumed samples: 7037952 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862784E+00 | loss scale: 1.0 | grad norm: 0.286 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:52:41.625732 | finish at 2025-09-10 11:48:38 + [2025-09-10 03:56:02] iteration 6874/ 11920 | consumed samples: 7038976 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.885166E+00 | loss scale: 1.0 | grad norm: 0.254 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:52:35.864269 | finish at 2025-09-10 11:48:38 + [2025-09-10 03:56:07] iteration 6875/ 11920 | consumed samples: 7040000 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861561E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:52:30.380714 | finish at 2025-09-10 11:48:38 + [2025-09-10 03:56:13] iteration 6876/ 11920 | consumed samples: 7041024 | elapsed time per iteration (ms): 5624.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857204E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:52:50.445988 | finish at 2025-09-10 11:49:03 + [2025-09-10 03:56:19] iteration 6877/ 11920 | consumed samples: 7042048 | elapsed time per iteration (ms): 5619.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.865889E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:52:19.918427 | finish at 2025-09-10 11:48:39 + [2025-09-10 03:56:24] iteration 6878/ 11920 | consumed samples: 7043072 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859053E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:52:45.041444 | finish at 2025-09-10 11:49:09 + [2025-09-10 03:56:30] iteration 6879/ 11920 | consumed samples: 7044096 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867126E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:52:21.833564 | finish at 2025-09-10 11:48:52 + [2025-09-10 03:56:36] iteration 6880/ 11920 | consumed samples: 7045120 | elapsed time per iteration (ms): 5633.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.865050E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:53:15.038280 | finish at 2025-09-10 11:49:51 + [2025-09-10 03:56:41] iteration 6881/ 11920 | consumed samples: 7046144 | elapsed time per iteration (ms): 5618.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852322E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:51:51.071234 | finish at 2025-09-10 11:48:32 + [2025-09-10 03:56:47] iteration 6882/ 11920 | consumed samples: 7047168 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857319E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:51:53.750407 | finish at 2025-09-10 11:48:40 + [2025-09-10 03:56:52] iteration 6883/ 11920 | consumed samples: 7048192 | elapsed time per iteration (ms): 5633.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.872903E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:52:55.285501 | finish at 2025-09-10 11:49:48 + [2025-09-10 03:56:58] iteration 6884/ 11920 | consumed samples: 7049216 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856075E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:52:06.493834 | finish at 2025-09-10 11:49:05 + [2025-09-10 03:57:04] iteration 6885/ 11920 | consumed samples: 7050240 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870130E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:51:43.345046 | finish at 2025-09-10 11:48:47 + [2025-09-10 03:57:09] iteration 6886/ 11920 | consumed samples: 7051264 | elapsed time per iteration (ms): 5631.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.866128E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:52:27.679615 | finish at 2025-09-10 11:49:37 + [2025-09-10 03:57:15] iteration 6887/ 11920 | consumed samples: 7052288 | elapsed time per iteration (ms): 5616.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.864039E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:51:09.882734 | finish at 2025-09-10 11:48:25 + [2025-09-10 03:57:20] iteration 6888/ 11920 | consumed samples: 7053312 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.865036E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:51:18.788467 | finish at 2025-09-10 11:48:39 + [2025-09-10 03:57:26] iteration 6889/ 11920 | consumed samples: 7054336 | elapsed time per iteration (ms): 5631.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857950E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:52:09.766323 | finish at 2025-09-10 11:49:36 + [2025-09-10 03:57:32] iteration 6890/ 11920 | consumed samples: 7055360 | elapsed time per iteration (ms): 5634.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860425E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:52:21.103406 | finish at 2025-09-10 11:49:53 + [2025-09-10 03:57:38] iteration 6891/ 11920 | consumed samples: 7056384 | elapsed time per iteration (ms): 5845.9 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858291E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:09:59.223237 | finish at 2025-09-10 12:07:37 + [2025-09-10 03:57:43] iteration 6892/ 11920 | consumed samples: 7057408 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.869237E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:51:08.228660 | finish at 2025-09-10 11:48:51 + [2025-09-10 03:57:49] iteration 6893/ 11920 | consumed samples: 7058432 | elapsed time per iteration (ms): 5628.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862717E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:51:34.367548 | finish at 2025-09-10 11:49:23 + [2025-09-10 03:57:54] iteration 6894/ 11920 | consumed samples: 7059456 | elapsed time per iteration (ms): 5616.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.869949E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:50:30.734557 | finish at 2025-09-10 11:48:25 + [2025-09-10 03:58:00] iteration 6895/ 11920 | consumed samples: 7060480 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861483E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:50:45.158654 | finish at 2025-09-10 11:48:45 + [2025-09-10 03:58:06] iteration 6896/ 11920 | consumed samples: 7061504 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.865358E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:50:45.887344 | finish at 2025-09-10 11:48:52 + [2025-09-10 03:58:11] iteration 6897/ 11920 | consumed samples: 7062528 | elapsed time per iteration (ms): 5628.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.877128E+00 | loss scale: 1.0 | grad norm: 0.251 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:51:12.263201 | finish at 2025-09-10 11:49:24 + [2025-09-10 03:58:17] iteration 6898/ 11920 | consumed samples: 7063552 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.863478E+00 | loss scale: 1.0 | grad norm: 0.257 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:51:00.451585 | finish at 2025-09-10 11:49:17 + [2025-09-10 03:58:23] iteration 6899/ 11920 | consumed samples: 7064576 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859732E+00 | loss scale: 1.0 | grad norm: 0.247 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:50:20.522561 | finish at 2025-09-10 11:48:43 + [2025-09-10 03:58:28] iteration 6900/ 11920 | consumed samples: 7065600 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857474E+00 | loss scale: 1.0 | grad norm: 0.230 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:50:17.718277 | finish at 2025-09-10 11:48:46 + [2025-09-10 03:58:34] iteration 6901/ 11920 | consumed samples: 7066624 | elapsed time per iteration (ms): 5625.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851752E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:50:33.933191 | finish at 2025-09-10 11:49:08 + [2025-09-10 03:58:40] iteration 6902/ 11920 | consumed samples: 7067648 | elapsed time per iteration (ms): 5947.5 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856944E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:17:24.308497 | finish at 2025-09-10 12:16:04 + [2025-09-10 03:58:45] iteration 6903/ 11920 | consumed samples: 7068672 | elapsed time per iteration (ms): 5632.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851824E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:50:58.465079 | finish at 2025-09-10 11:49:44 + [2025-09-10 03:58:51] iteration 6904/ 11920 | consumed samples: 7069696 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862383E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:49:51.114138 | finish at 2025-09-10 11:48:42 + [2025-09-10 03:58:57] iteration 6905/ 11920 | consumed samples: 7070720 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845744E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:49:50.985608 | finish at 2025-09-10 11:48:48 + [2025-09-10 03:59:02] iteration 6906/ 11920 | consumed samples: 7071744 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867854E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:49:58.719627 | finish at 2025-09-10 11:49:01 + [2025-09-10 03:59:08] iteration 6907/ 11920 | consumed samples: 7072768 | elapsed time per iteration (ms): 5618.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845476E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:49:27.663133 | finish at 2025-09-10 11:48:36 + [2025-09-10 03:59:14] iteration 6908/ 11920 | consumed samples: 7073792 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867872E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:49:26.446420 | finish at 2025-09-10 11:48:40 + [2025-09-10 03:59:19] iteration 6909/ 11920 | consumed samples: 7074816 | elapsed time per iteration (ms): 5624.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852726E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:49:44.918056 | finish at 2025-09-10 11:49:04 + [2025-09-10 03:59:25] iteration 6910/ 11920 | consumed samples: 7075840 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.853083E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:49:37.587733 | finish at 2025-09-10 11:49:02 + [2025-09-10 03:59:30] iteration 6911/ 11920 | consumed samples: 7076864 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850462E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:49:30.391846 | finish at 2025-09-10 11:49:01 + [2025-09-10 03:59:36] iteration 6912/ 11920 | consumed samples: 7077888 | elapsed time per iteration (ms): 5989.6 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.874157E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:19:55.692043 | finish at 2025-09-10 12:19:32 + [2025-09-10 03:59:42] iteration 6913/ 11920 | consumed samples: 7078912 | elapsed time per iteration (ms): 5630.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860605E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:49:50.459890 | finish at 2025-09-10 11:49:32 + [2025-09-10 03:59:48] iteration 6914/ 11920 | consumed samples: 7079936 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844904E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:49:07.273079 | finish at 2025-09-10 11:48:55 + [2025-09-10 03:59:53] iteration 6915/ 11920 | consumed samples: 7080960 | elapsed time per iteration (ms): 5617.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870110E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:48:35.290706 | finish at 2025-09-10 11:48:29 + [2025-09-10 03:59:59] iteration 6916/ 11920 | consumed samples: 7081984 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.869210E+00 | loss scale: 1.0 | grad norm: 0.230 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:48:42.626172 | finish at 2025-09-10 11:48:42 + [2025-09-10 04:00:05] iteration 6917/ 11920 | consumed samples: 7083008 | elapsed time per iteration (ms): 5630.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861293E+00 | loss scale: 1.0 | grad norm: 0.276 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:49:31.181103 | finish at 2025-09-10 11:49:36 + [2025-09-10 04:00:10] iteration 6918/ 11920 | consumed samples: 7084032 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.863386E+00 | loss scale: 1.0 | grad norm: 0.252 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:48:49.681326 | finish at 2025-09-10 11:49:00 + [2025-09-10 04:00:16] iteration 6919/ 11920 | consumed samples: 7085056 | elapsed time per iteration (ms): 5632.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856680E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:49:26.915988 | finish at 2025-09-10 11:49:43 + [2025-09-10 04:00:21] iteration 6920/ 11920 | consumed samples: 7086080 | elapsed time per iteration (ms): 5630.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870051E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:49:09.806261 | finish at 2025-09-10 11:49:31 + [2025-09-10 04:00:27] iteration 6921/ 11920 | consumed samples: 7087104 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867440E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:48:42.746756 | finish at 2025-09-10 11:49:10 + [2025-09-10 04:00:33] iteration 6922/ 11920 | consumed samples: 7088128 | elapsed time per iteration (ms): 5628.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.866202E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:48:50.554170 | finish at 2025-09-10 11:49:23 + [2025-09-10 04:00:39] iteration 6923/ 11920 | consumed samples: 7089152 | elapsed time per iteration (ms): 5968.8 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845726E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:17:06.192824 | finish at 2025-09-10 12:17:45 + [2025-09-10 04:00:44] iteration 6924/ 11920 | consumed samples: 7090176 | elapsed time per iteration (ms): 5626.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.872330E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:48:29.758802 | finish at 2025-09-10 11:49:14 + [2025-09-10 04:00:50] iteration 6925/ 11920 | consumed samples: 7091200 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854802E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:48:03.064123 | finish at 2025-09-10 11:48:53 + [2025-09-10 04:00:55] iteration 6926/ 11920 | consumed samples: 7092224 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870861E+00 | loss scale: 1.0 | grad norm: 0.230 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:48:10.916615 | finish at 2025-09-10 11:49:06 + [2025-09-10 04:01:01] iteration 6927/ 11920 | consumed samples: 7093248 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856660E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:48:05.291681 | finish at 2025-09-10 11:49:06 + [2025-09-10 04:01:07] iteration 6928/ 11920 | consumed samples: 7094272 | elapsed time per iteration (ms): 5618.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858643E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:47:29.124207 | finish at 2025-09-10 11:48:36 + [2025-09-10 04:01:13] iteration 6929/ 11920 | consumed samples: 7095296 | elapsed time per iteration (ms): 5882.6 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862786E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:09:20.045045 | finish at 2025-09-10 12:10:33 + [2025-09-10 04:01:18] iteration 6930/ 11920 | consumed samples: 7096320 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.873980E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:48:00.055802 | finish at 2025-09-10 11:49:18 + [2025-09-10 04:01:24] iteration 6931/ 11920 | consumed samples: 7097344 | elapsed time per iteration (ms): 5631.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846802E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:48:15.012320 | finish at 2025-09-10 11:49:39 + [2025-09-10 04:01:30] iteration 6932/ 11920 | consumed samples: 7098368 | elapsed time per iteration (ms): 5853.3 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.865355E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:06:36.238276 | finish at 2025-09-10 12:08:06 + [2025-09-10 04:01:35] iteration 6933/ 11920 | consumed samples: 7099392 | elapsed time per iteration (ms): 5626.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867718E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:47:41.067108 | finish at 2025-09-10 11:49:16 + [2025-09-10 04:01:41] iteration 6934/ 11920 | consumed samples: 7100416 | elapsed time per iteration (ms): 5627.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.864986E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:47:38.153004 | finish at 2025-09-10 11:49:19 + [2025-09-10 04:01:47] iteration 6935/ 11920 | consumed samples: 7101440 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.866144E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:47:22.022696 | finish at 2025-09-10 11:49:09 + [2025-09-10 04:01:52] iteration 6936/ 11920 | consumed samples: 7102464 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858660E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:47:03.447559 | finish at 2025-09-10 11:48:56 + [2025-09-10 04:01:58] iteration 6937/ 11920 | consumed samples: 7103488 | elapsed time per iteration (ms): 5617.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845992E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:46:33.978543 | finish at 2025-09-10 11:48:32 + [2025-09-10 04:02:03] iteration 6938/ 11920 | consumed samples: 7104512 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.865395E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:46:40.530859 | finish at 2025-09-10 11:48:44 + [2025-09-10 04:02:09] iteration 6939/ 11920 | consumed samples: 7105536 | elapsed time per iteration (ms): 5950.5 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858583E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:13:59.195919 | finish at 2025-09-10 12:16:09 + [2025-09-10 04:02:15] iteration 6940/ 11920 | consumed samples: 7106560 | elapsed time per iteration (ms): 5625.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859388E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:46:56.837296 | finish at 2025-09-10 11:49:12 + [2025-09-10 04:02:21] iteration 6941/ 11920 | consumed samples: 7107584 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.868608E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:46:28.260303 | finish at 2025-09-10 11:48:49 + [2025-09-10 04:02:26] iteration 6942/ 11920 | consumed samples: 7108608 | elapsed time per iteration (ms): 5629.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.868750E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:47:01.753981 | finish at 2025-09-10 11:49:28 + [2025-09-10 04:02:32] iteration 6943/ 11920 | consumed samples: 7109632 | elapsed time per iteration (ms): 5630.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867251E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:47:00.797729 | finish at 2025-09-10 11:49:33 + [2025-09-10 04:02:38] iteration 6944/ 11920 | consumed samples: 7110656 | elapsed time per iteration (ms): 5824.1 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.84% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.872633E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:03:00.836765 | finish at 2025-09-10 12:05:39 + [2025-09-10 04:02:44] iteration 6945/ 11920 | consumed samples: 7111680 | elapsed time per iteration (ms): 5939.5 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862214E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:12:28.818439 | finish at 2025-09-10 12:15:13 + [2025-09-10 04:02:49] iteration 6946/ 11920 | consumed samples: 7112704 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862665E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:46:05.687377 | finish at 2025-09-10 11:48:55 + [2025-09-10 04:02:55] iteration 6947/ 11920 | consumed samples: 7113728 | elapsed time per iteration (ms): 5616.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870944E+00 | loss scale: 1.0 | grad norm: 0.123 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:45:32.963288 | finish at 2025-09-10 11:48:28 + [2025-09-10 04:03:01] iteration 6948/ 11920 | consumed samples: 7114752 | elapsed time per iteration (ms): 5619.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.853044E+00 | loss scale: 1.0 | grad norm: 0.114 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:45:39.620173 | finish at 2025-09-10 11:48:40 + [2025-09-10 04:03:06] iteration 6949/ 11920 | consumed samples: 7115776 | elapsed time per iteration (ms): 5615.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861047E+00 | loss scale: 1.0 | grad norm: 0.121 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:45:14.063704 | finish at 2025-09-10 11:48:20 + [2025-09-10 04:03:12] iteration 6950/ 11920 | consumed samples: 7116800 | elapsed time per iteration (ms): 5947.2 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856038E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:12:37.693312 | finish at 2025-09-10 12:15:50 + [2025-09-10 04:03:18] iteration 6951/ 11920 | consumed samples: 7117824 | elapsed time per iteration (ms): 5621.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860989E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:45:30.616569 | finish at 2025-09-10 11:48:48 + [2025-09-10 04:03:24] iteration 6952/ 11920 | consumed samples: 7118848 | elapsed time per iteration (ms): 5965.6 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861674E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:13:57.087839 | finish at 2025-09-10 12:17:21 + [2025-09-10 04:03:30] iteration 6953/ 11920 | consumed samples: 7119872 | elapsed time per iteration (ms): 6331.9 | throughput per GPU (TFLOP/s/GPU): 71.3 | MFU 7.21% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841344E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:44:10.669870 | finish at 2025-09-10 12:47:41 + [2025-09-10 04:03:36] iteration 6954/ 11920 | consumed samples: 7120896 | elapsed time per iteration (ms): 5924.7 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861398E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:10:22.228434 | finish at 2025-09-10 12:13:58 + [2025-09-10 04:03:42] iteration 6955/ 11920 | consumed samples: 7121920 | elapsed time per iteration (ms): 5890.1 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.865061E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:07:24.382024 | finish at 2025-09-10 12:11:06 + [2025-09-10 04:03:48] iteration 6956/ 11920 | consumed samples: 7122944 | elapsed time per iteration (ms): 5888.6 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.864214E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:07:10.913903 | finish at 2025-09-10 12:10:59 + [2025-09-10 04:03:53] iteration 6957/ 11920 | consumed samples: 7123968 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850946E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:44:56.017475 | finish at 2025-09-10 11:48:49 + [2025-09-10 04:03:59] iteration 6958/ 11920 | consumed samples: 7124992 | elapsed time per iteration (ms): 5618.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862649E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:44:39.728086 | finish at 2025-09-10 11:48:39 + [2025-09-10 04:04:05] iteration 6959/ 11920 | consumed samples: 7126016 | elapsed time per iteration (ms): 5618.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.873185E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:44:35.050943 | finish at 2025-09-10 11:48:40 + [2025-09-10 04:04:10] iteration 6960/ 11920 | consumed samples: 7127040 | elapsed time per iteration (ms): 5844.0 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.864970E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:03:06.346931 | finish at 2025-09-10 12:07:17 + [2025-09-10 04:04:16] iteration 6961/ 11920 | consumed samples: 7128064 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856677E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:44:50.191960 | finish at 2025-09-10 11:49:06 + [2025-09-10 04:04:22] iteration 6962/ 11920 | consumed samples: 7129088 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856728E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:44:49.693299 | finish at 2025-09-10 11:49:11 + [2025-09-10 04:04:27] iteration 6963/ 11920 | consumed samples: 7130112 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.878118E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:44:43.426369 | finish at 2025-09-10 11:49:11 + [2025-09-10 04:04:33] iteration 6964/ 11920 | consumed samples: 7131136 | elapsed time per iteration (ms): 5633.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852473E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:45:17.436982 | finish at 2025-09-10 11:49:50 + [2025-09-10 04:04:39] iteration 6965/ 11920 | consumed samples: 7132160 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859992E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:44:34.882753 | finish at 2025-09-10 11:49:13 + [2025-09-10 04:04:44] iteration 6966/ 11920 | consumed samples: 7133184 | elapsed time per iteration (ms): 5623.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867936E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:44:17.052575 | finish at 2025-09-10 11:49:01 + [2025-09-10 04:04:50] iteration 6967/ 11920 | consumed samples: 7134208 | elapsed time per iteration (ms): 5904.4 | throughput per GPU (TFLOP/s/GPU): 76.5 | MFU 7.73% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862463E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:07:24.457139 | finish at 2025-09-10 12:12:15 + [2025-09-10 04:04:56] iteration 6968/ 11920 | consumed samples: 7135232 | elapsed time per iteration (ms): 5991.2 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855908E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:14:28.288839 | finish at 2025-09-10 12:19:24 + [2025-09-10 04:05:02] iteration 6969/ 11920 | consumed samples: 7136256 | elapsed time per iteration (ms): 5616.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855882E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:43:27.100963 | finish at 2025-09-10 11:48:29 + [2025-09-10 04:05:08] iteration 6970/ 11920 | consumed samples: 7137280 | elapsed time per iteration (ms): 5847.4 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850623E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:02:24.758391 | finish at 2025-09-10 12:07:32 + [2025-09-10 04:05:13] iteration 6971/ 11920 | consumed samples: 7138304 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857135E+00 | loss scale: 1.0 | grad norm: 0.230 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:43:40.582928 | finish at 2025-09-10 11:48:54 + [2025-09-10 04:05:19] iteration 6972/ 11920 | consumed samples: 7139328 | elapsed time per iteration (ms): 5837.8 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867691E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:01:25.619740 | finish at 2025-09-10 12:06:45 + [2025-09-10 04:05:25] iteration 6973/ 11920 | consumed samples: 7140352 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852952E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:43:46.956383 | finish at 2025-09-10 11:49:12 + [2025-09-10 04:05:30] iteration 6974/ 11920 | consumed samples: 7141376 | elapsed time per iteration (ms): 5626.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.869212E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:43:49.445567 | finish at 2025-09-10 11:49:20 + [2025-09-10 04:05:36] iteration 6975/ 11920 | consumed samples: 7142400 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862696E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:43:39.013388 | finish at 2025-09-10 11:49:15 + [2025-09-10 04:05:42] iteration 6976/ 11920 | consumed samples: 7143424 | elapsed time per iteration (ms): 5963.1 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.869358E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:11:21.411770 | finish at 2025-09-10 12:17:03 + [2025-09-10 04:05:47] iteration 6977/ 11920 | consumed samples: 7144448 | elapsed time per iteration (ms): 5622.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.864398E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:43:11.930008 | finish at 2025-09-10 11:48:59 + [2025-09-10 04:05:54] iteration 6978/ 11920 | consumed samples: 7145472 | elapsed time per iteration (ms): 6121.0 | throughput per GPU (TFLOP/s/GPU): 73.8 | MFU 7.46% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870815E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:24:09.768989 | finish at 2025-09-10 12:30:03 + [2025-09-10 04:05:59] iteration 6979/ 11920 | consumed samples: 7146496 | elapsed time per iteration (ms): 5618.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844618E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:42:40.175607 | finish at 2025-09-10 11:48:39 + [2025-09-10 04:06:05] iteration 6980/ 11920 | consumed samples: 7147520 | elapsed time per iteration (ms): 5617.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.863900E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:42:30.042815 | finish at 2025-09-10 11:48:35 + [2025-09-10 04:06:10] iteration 6981/ 11920 | consumed samples: 7148544 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.864355E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:42:39.596944 | finish at 2025-09-10 11:48:50 + [2025-09-10 04:06:16] iteration 6982/ 11920 | consumed samples: 7149568 | elapsed time per iteration (ms): 5616.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860185E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:42:14.855747 | finish at 2025-09-10 11:48:31 + [2025-09-10 04:06:22] iteration 6983/ 11920 | consumed samples: 7150592 | elapsed time per iteration (ms): 5618.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867843E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:42:20.350695 | finish at 2025-09-10 11:48:42 + [2025-09-10 04:06:27] iteration 6984/ 11920 | consumed samples: 7151616 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.882262E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:42:23.266228 | finish at 2025-09-10 11:48:51 + [2025-09-10 04:06:33] iteration 6985/ 11920 | consumed samples: 7152640 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858834E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:42:36.026409 | finish at 2025-09-10 11:49:09 + [2025-09-10 04:06:39] iteration 6986/ 11920 | consumed samples: 7153664 | elapsed time per iteration (ms): 5627.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854698E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:42:46.546414 | finish at 2025-09-10 11:49:25 + [2025-09-10 04:06:45] iteration 6987/ 11920 | consumed samples: 7154688 | elapsed time per iteration (ms): 5932.8 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.865139E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:07:46.367415 | finish at 2025-09-10 12:14:31 + [2025-09-10 04:06:50] iteration 6988/ 11920 | consumed samples: 7155712 | elapsed time per iteration (ms): 5919.7 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844218E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:06:35.887038 | finish at 2025-09-10 12:13:26 + [2025-09-10 04:06:56] iteration 6989/ 11920 | consumed samples: 7156736 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.881120E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:41:56.296562 | finish at 2025-09-10 11:48:52 + [2025-09-10 04:07:02] iteration 6990/ 11920 | consumed samples: 7157760 | elapsed time per iteration (ms): 5827.0 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860190E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:58:46.919105 | finish at 2025-09-10 12:05:49 + [2025-09-10 04:07:07] iteration 6991/ 11920 | consumed samples: 7158784 | elapsed time per iteration (ms): 5619.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862215E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:41:38.902919 | finish at 2025-09-10 11:48:46 + [2025-09-10 04:07:13] iteration 6992/ 11920 | consumed samples: 7159808 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.873699E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:41:52.675507 | finish at 2025-09-10 11:49:06 + [2025-09-10 04:07:19] iteration 6993/ 11920 | consumed samples: 7160832 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.866234E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:41:38.439178 | finish at 2025-09-10 11:48:57 + [2025-09-10 04:07:24] iteration 6994/ 11920 | consumed samples: 7161856 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845300E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:41:31.169659 | finish at 2025-09-10 11:48:56 + [2025-09-10 04:07:30] iteration 6995/ 11920 | consumed samples: 7162880 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.864605E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:41:38.720533 | finish at 2025-09-10 11:49:09 + [2025-09-10 04:07:36] iteration 6996/ 11920 | consumed samples: 7163904 | elapsed time per iteration (ms): 5634.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857925E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:42:25.989784 | finish at 2025-09-10 11:50:02 + [2025-09-10 04:07:41] iteration 6997/ 11920 | consumed samples: 7164928 | elapsed time per iteration (ms): 5634.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.865515E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:42:16.026203 | finish at 2025-09-10 11:49:57 + [2025-09-10 04:07:47] iteration 6998/ 11920 | consumed samples: 7165952 | elapsed time per iteration (ms): 5628.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850606E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:41:43.130744 | finish at 2025-09-10 11:49:30 + [2025-09-10 04:07:53] iteration 6999/ 11920 | consumed samples: 7166976 | elapsed time per iteration (ms): 5626.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862499E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:41:28.590247 | finish at 2025-09-10 11:49:21 + [2025-09-10 04:07:58] iteration 7000/ 11920 | consumed samples: 7168000 | elapsed time per iteration (ms): 5616.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856639E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:40:32.333765 | finish at 2025-09-10 11:48:30 + [2025-09-10 04:08:04] iteration 7001/ 11920 | consumed samples: 7169024 | elapsed time per iteration (ms): 5632.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856175E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:41:47.021270 | finish at 2025-09-10 11:49:51 + [2025-09-10 04:08:09] iteration 7002/ 11920 | consumed samples: 7170048 | elapsed time per iteration (ms): 5634.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.853726E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:41:49.970456 | finish at 2025-09-10 11:49:59 + [2025-09-10 04:08:15] iteration 7003/ 11920 | consumed samples: 7171072 | elapsed time per iteration (ms): 5837.9 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.864465E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:58:24.813246 | finish at 2025-09-10 12:06:40 + [2025-09-10 04:08:21] iteration 7004/ 11920 | consumed samples: 7172096 | elapsed time per iteration (ms): 5631.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.871145E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:41:22.649047 | finish at 2025-09-10 11:49:44 + [2025-09-10 04:08:27] iteration 7005/ 11920 | consumed samples: 7173120 | elapsed time per iteration (ms): 5637.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849524E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:41:46.339377 | finish at 2025-09-10 11:50:13 + [2025-09-10 04:08:32] iteration 7006/ 11920 | consumed samples: 7174144 | elapsed time per iteration (ms): 5636.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857411E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:41:37.148849 | finish at 2025-09-10 11:50:09 + [2025-09-10 04:08:38] iteration 7007/ 11920 | consumed samples: 7175168 | elapsed time per iteration (ms): 6329.0 | throughput per GPU (TFLOP/s/GPU): 71.3 | MFU 7.21% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.864315E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:38:14.302015 | finish at 2025-09-10 12:46:53 + [2025-09-10 04:08:44] iteration 7008/ 11920 | consumed samples: 7176192 | elapsed time per iteration (ms): 5934.4 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855480E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:05:49.681854 | finish at 2025-09-10 12:14:34 + [2025-09-10 04:08:50] iteration 7009/ 11920 | consumed samples: 7177216 | elapsed time per iteration (ms): 5982.3 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854780E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:09:38.943203 | finish at 2025-09-10 12:18:29 + [2025-09-10 04:08:56] iteration 7010/ 11920 | consumed samples: 7178240 | elapsed time per iteration (ms): 5619.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859734E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:39:52.562890 | finish at 2025-09-10 11:48:49 + [2025-09-10 04:09:02] iteration 7011/ 11920 | consumed samples: 7179264 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850582E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:39:50.852349 | finish at 2025-09-10 11:48:52 + [2025-09-10 04:09:07] iteration 7012/ 11920 | consumed samples: 7180288 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860660E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:40:07.099806 | finish at 2025-09-10 11:49:14 + [2025-09-10 04:09:13] iteration 7013/ 11920 | consumed samples: 7181312 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861388E+00 | loss scale: 1.0 | grad norm: 0.242 | num zeros: 10.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:40:05.933452 | finish at 2025-09-10 11:49:19 + [2025-09-10 04:09:19] iteration 7014/ 11920 | consumed samples: 7182336 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.866682E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:39:45.621104 | finish at 2025-09-10 11:49:04 + [2025-09-10 04:09:24] iteration 7015/ 11920 | consumed samples: 7183360 | elapsed time per iteration (ms): 5628.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870947E+00 | loss scale: 1.0 | grad norm: 0.254 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:40:08.368961 | finish at 2025-09-10 11:49:33 + [2025-09-10 04:09:30] iteration 7016/ 11920 | consumed samples: 7184384 | elapsed time per iteration (ms): 5628.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861404E+00 | loss scale: 1.0 | grad norm: 0.260 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:40:02.001406 | finish at 2025-09-10 11:49:32 + [2025-09-10 04:09:35] iteration 7017/ 11920 | consumed samples: 7185408 | elapsed time per iteration (ms): 5628.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861507E+00 | loss scale: 1.0 | grad norm: 0.251 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:39:58.319268 | finish at 2025-09-10 11:49:34 + [2025-09-10 04:09:41] iteration 7018/ 11920 | consumed samples: 7186432 | elapsed time per iteration (ms): 5629.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.863644E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:39:54.386228 | finish at 2025-09-10 11:49:35 + [2025-09-10 04:09:47] iteration 7019/ 11920 | consumed samples: 7187456 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.866120E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:39:17.380739 | finish at 2025-09-10 11:49:04 + [2025-09-10 04:09:52] iteration 7020/ 11920 | consumed samples: 7188480 | elapsed time per iteration (ms): 5623.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.853459E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:39:12.816367 | finish at 2025-09-10 11:49:05 + [2025-09-10 04:09:58] iteration 7021/ 11920 | consumed samples: 7189504 | elapsed time per iteration (ms): 5625.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855515E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:39:18.598987 | finish at 2025-09-10 11:49:16 + [2025-09-10 04:10:04] iteration 7022/ 11920 | consumed samples: 7190528 | elapsed time per iteration (ms): 5627.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.866448E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:39:24.219300 | finish at 2025-09-10 11:49:28 + [2025-09-10 04:10:09] iteration 7023/ 11920 | consumed samples: 7191552 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867669E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:38:41.983568 | finish at 2025-09-10 11:48:51 + [2025-09-10 04:10:15] iteration 7024/ 11920 | consumed samples: 7192576 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870818E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:38:44.045380 | finish at 2025-09-10 11:48:59 + [2025-09-10 04:10:20] iteration 7025/ 11920 | consumed samples: 7193600 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.847701E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:38:38.728241 | finish at 2025-09-10 11:48:59 + [2025-09-10 04:10:26] iteration 7026/ 11920 | consumed samples: 7194624 | elapsed time per iteration (ms): 5627.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861437E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:39:01.508016 | finish at 2025-09-10 11:49:28 + [2025-09-10 04:10:32] iteration 7027/ 11920 | consumed samples: 7195648 | elapsed time per iteration (ms): 5618.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.863646E+00 | loss scale: 1.0 | grad norm: 0.137 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:38:09.316284 | finish at 2025-09-10 11:48:41 + [2025-09-10 04:10:37] iteration 7028/ 11920 | consumed samples: 7196672 | elapsed time per iteration (ms): 5631.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.833733E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:39:09.920856 | finish at 2025-09-10 11:49:47 + [2025-09-10 04:10:43] iteration 7029/ 11920 | consumed samples: 7197696 | elapsed time per iteration (ms): 5626.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846758E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:38:38.486817 | finish at 2025-09-10 11:49:21 + [2025-09-10 04:10:49] iteration 7030/ 11920 | consumed samples: 7198720 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859261E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:38:17.650566 | finish at 2025-09-10 11:49:06 + [2025-09-10 04:10:54] iteration 7031/ 11920 | consumed samples: 7199744 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.853816E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:37:59.274184 | finish at 2025-09-10 11:48:53 + [2025-09-10 04:11:00] iteration 7032/ 11920 | consumed samples: 7200768 | elapsed time per iteration (ms): 5616.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852842E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:37:34.522509 | finish at 2025-09-10 11:48:34 + [2025-09-10 04:11:05] iteration 7033/ 11920 | consumed samples: 7201792 | elapsed time per iteration (ms): 5614.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.871436E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:37:15.548492 | finish at 2025-09-10 11:48:21 + [2025-09-10 04:11:11] iteration 7034/ 11920 | consumed samples: 7202816 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862233E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:37:39.776087 | finish at 2025-09-10 11:48:51 + [2025-09-10 04:11:17] iteration 7035/ 11920 | consumed samples: 7203840 | elapsed time per iteration (ms): 5619.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842899E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:37:31.626320 | finish at 2025-09-10 11:48:48 + [2025-09-10 04:11:22] iteration 7036/ 11920 | consumed samples: 7204864 | elapsed time per iteration (ms): 5617.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854685E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:37:17.414369 | finish at 2025-09-10 11:48:40 + [2025-09-10 04:11:28] iteration 7037/ 11920 | consumed samples: 7205888 | elapsed time per iteration (ms): 5631.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862393E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:38:16.648198 | finish at 2025-09-10 11:49:44 + [2025-09-10 04:11:33] iteration 7038/ 11920 | consumed samples: 7206912 | elapsed time per iteration (ms): 5633.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834282E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:38:22.613629 | finish at 2025-09-10 11:49:56 + [2025-09-10 04:11:39] iteration 7039/ 11920 | consumed samples: 7207936 | elapsed time per iteration (ms): 5628.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855208E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:37:50.893021 | finish at 2025-09-10 11:49:30 + [2025-09-10 04:11:45] iteration 7040/ 11920 | consumed samples: 7208960 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862995E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:37:02.958336 | finish at 2025-09-10 11:48:48 + [2025-09-10 04:11:50] iteration 7041/ 11920 | consumed samples: 7209984 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.853912E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:37:07.923237 | finish at 2025-09-10 11:48:58 + [2025-09-10 04:11:56] iteration 7042/ 11920 | consumed samples: 7211008 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861112E+00 | loss scale: 1.0 | grad norm: 0.230 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:36:55.852741 | finish at 2025-09-10 11:48:52 + [2025-09-10 04:12:02] iteration 7043/ 11920 | consumed samples: 7212032 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.853354E+00 | loss scale: 1.0 | grad norm: 0.251 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:37:00.289211 | finish at 2025-09-10 11:49:02 + [2025-09-10 04:12:07] iteration 7044/ 11920 | consumed samples: 7213056 | elapsed time per iteration (ms): 5631.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861413E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:37:36.960810 | finish at 2025-09-10 11:49:44 + [2025-09-10 04:12:13] iteration 7045/ 11920 | consumed samples: 7214080 | elapsed time per iteration (ms): 5618.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855178E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:36:31.025484 | finish at 2025-09-10 11:48:44 + [2025-09-10 04:12:18] iteration 7046/ 11920 | consumed samples: 7215104 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.871477E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:36:31.315847 | finish at 2025-09-10 11:48:50 + [2025-09-10 04:12:24] iteration 7047/ 11920 | consumed samples: 7216128 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850322E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:36:34.746492 | finish at 2025-09-10 11:48:59 + [2025-09-10 04:12:30] iteration 7048/ 11920 | consumed samples: 7217152 | elapsed time per iteration (ms): 5629.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862438E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:37:08.793709 | finish at 2025-09-10 11:49:39 + [2025-09-10 04:12:35] iteration 7049/ 11920 | consumed samples: 7218176 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.868902E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:36:45.624154 | finish at 2025-09-10 11:49:21 + [2025-09-10 04:12:41] iteration 7050/ 11920 | consumed samples: 7219200 | elapsed time per iteration (ms): 5631.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.853505E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:37:04.398355 | finish at 2025-09-10 11:49:45 + [2025-09-10 04:12:47] iteration 7051/ 11920 | consumed samples: 7220224 | elapsed time per iteration (ms): 5958.4 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857103E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:03:31.573866 | finish at 2025-09-10 12:16:19 + [2025-09-10 04:12:53] iteration 7052/ 11920 | consumed samples: 7221248 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842532E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:36:20.890218 | finish at 2025-09-10 11:49:13 + [2025-09-10 04:12:58] iteration 7053/ 11920 | consumed samples: 7222272 | elapsed time per iteration (ms): 5622.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861987E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:36:04.878958 | finish at 2025-09-10 11:49:03 + [2025-09-10 04:13:04] iteration 7054/ 11920 | consumed samples: 7223296 | elapsed time per iteration (ms): 5621.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843232E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:35:51.734044 | finish at 2025-09-10 11:48:56 + [2025-09-10 04:13:09] iteration 7055/ 11920 | consumed samples: 7224320 | elapsed time per iteration (ms): 5615.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860214E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:35:21.240022 | finish at 2025-09-10 11:48:31 + [2025-09-10 04:13:15] iteration 7056/ 11920 | consumed samples: 7225344 | elapsed time per iteration (ms): 5828.8 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852063E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:52:31.250549 | finish at 2025-09-10 12:05:47 + [2025-09-10 04:13:21] iteration 7057/ 11920 | consumed samples: 7226368 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870459E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:35:27.800875 | finish at 2025-09-10 11:48:49 + [2025-09-10 04:13:26] iteration 7058/ 11920 | consumed samples: 7227392 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860773E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:35:32.521324 | finish at 2025-09-10 11:48:59 + [2025-09-10 04:13:32] iteration 7059/ 11920 | consumed samples: 7228416 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850899E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:35:27.392217 | finish at 2025-09-10 11:49:00 + [2025-09-10 04:13:38] iteration 7060/ 11920 | consumed samples: 7229440 | elapsed time per iteration (ms): 6023.6 | throughput per GPU (TFLOP/s/GPU): 75.0 | MFU 7.58% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858386E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:07:54.749794 | finish at 2025-09-10 12:21:33 + [2025-09-10 04:13:44] iteration 7061/ 11920 | consumed samples: 7230464 | elapsed time per iteration (ms): 5916.2 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857210E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:59:06.686833 | finish at 2025-09-10 12:12:51 + [2025-09-10 04:13:50] iteration 7062/ 11920 | consumed samples: 7231488 | elapsed time per iteration (ms): 5615.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849845E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:34:38.741416 | finish at 2025-09-10 11:48:28 + [2025-09-10 04:13:55] iteration 7063/ 11920 | consumed samples: 7232512 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.863513E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:34:57.088670 | finish at 2025-09-10 11:48:52 + [2025-09-10 04:14:01] iteration 7064/ 11920 | consumed samples: 7233536 | elapsed time per iteration (ms): 5631.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854046E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:35:47.005135 | finish at 2025-09-10 11:49:48 + [2025-09-10 04:14:07] iteration 7065/ 11920 | consumed samples: 7234560 | elapsed time per iteration (ms): 5933.9 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867403E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:00:09.079285 | finish at 2025-09-10 12:14:16 + [2025-09-10 04:14:12] iteration 7066/ 11920 | consumed samples: 7235584 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840971E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:34:47.618621 | finish at 2025-09-10 11:49:00 + [2025-09-10 04:14:18] iteration 7067/ 11920 | consumed samples: 7236608 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844901E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:34:39.310285 | finish at 2025-09-10 11:48:57 + [2025-09-10 04:14:24] iteration 7068/ 11920 | consumed samples: 7237632 | elapsed time per iteration (ms): 5627.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851183E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:35:06.477698 | finish at 2025-09-10 11:49:30 + [2025-09-10 04:14:30] iteration 7069/ 11920 | consumed samples: 7238656 | elapsed time per iteration (ms): 5938.4 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.869493E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:00:06.963424 | finish at 2025-09-10 12:14:37 + [2025-09-10 04:14:36] iteration 7070/ 11920 | consumed samples: 7239680 | elapsed time per iteration (ms): 6332.0 | throughput per GPU (TFLOP/s/GPU): 71.3 | MFU 7.21% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.863837E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:31:50.425568 | finish at 2025-09-10 12:46:26 + [2025-09-10 04:14:42] iteration 7071/ 11920 | consumed samples: 7240704 | elapsed time per iteration (ms): 6093.4 | throughput per GPU (TFLOP/s/GPU): 74.1 | MFU 7.49% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862242E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:12:26.808744 | finish at 2025-09-10 12:27:09 + [2025-09-10 04:14:48] iteration 7072/ 11920 | consumed samples: 7241728 | elapsed time per iteration (ms): 5879.0 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848748E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:55:01.407932 | finish at 2025-09-10 12:09:49 + [2025-09-10 04:14:54] iteration 7073/ 11920 | consumed samples: 7242752 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852603E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:34:05.793869 | finish at 2025-09-10 11:48:59 + [2025-09-10 04:14:59] iteration 7074/ 11920 | consumed samples: 7243776 | elapsed time per iteration (ms): 5626.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.873571E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:34:26.698990 | finish at 2025-09-10 11:49:26 + [2025-09-10 04:15:05] iteration 7075/ 11920 | consumed samples: 7244800 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849032E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:33:56.645801 | finish at 2025-09-10 11:49:01 + [2025-09-10 04:15:11] iteration 7076/ 11920 | consumed samples: 7245824 | elapsed time per iteration (ms): 5846.0 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860883E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:51:57.843623 | finish at 2025-09-10 12:07:09 + [2025-09-10 04:15:16] iteration 7077/ 11920 | consumed samples: 7246848 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851374E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:33:46.467201 | finish at 2025-09-10 11:49:03 + [2025-09-10 04:15:22] iteration 7078/ 11920 | consumed samples: 7247872 | elapsed time per iteration (ms): 5946.4 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857551E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:59:52.656618 | finish at 2025-09-10 12:15:15 + [2025-09-10 04:15:28] iteration 7079/ 11920 | consumed samples: 7248896 | elapsed time per iteration (ms): 5628.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.869205E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:34:06.182251 | finish at 2025-09-10 11:49:34 + [2025-09-10 04:15:34] iteration 7080/ 11920 | consumed samples: 7249920 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861699E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:33:51.153994 | finish at 2025-09-10 11:49:25 + [2025-09-10 04:15:39] iteration 7081/ 11920 | consumed samples: 7250944 | elapsed time per iteration (ms): 5628.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848564E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:33:56.245665 | finish at 2025-09-10 11:49:35 + [2025-09-10 04:15:45] iteration 7082/ 11920 | consumed samples: 7251968 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.853494E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:33:28.050709 | finish at 2025-09-10 11:49:13 + [2025-09-10 04:15:50] iteration 7083/ 11920 | consumed samples: 7252992 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854486E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:33:12.664790 | finish at 2025-09-10 11:49:03 + [2025-09-10 04:15:56] iteration 7084/ 11920 | consumed samples: 7254016 | elapsed time per iteration (ms): 5896.9 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870326E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:55:17.502648 | finish at 2025-09-10 12:11:14 + [2025-09-10 04:16:02] iteration 7085/ 11920 | consumed samples: 7255040 | elapsed time per iteration (ms): 5969.4 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852062E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:01:02.060184 | finish at 2025-09-10 12:17:04 + [2025-09-10 04:16:08] iteration 7086/ 11920 | consumed samples: 7256064 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852105E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:33:07.793606 | finish at 2025-09-10 11:49:16 + [2025-09-10 04:16:14] iteration 7087/ 11920 | consumed samples: 7257088 | elapsed time per iteration (ms): 5625.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.868605E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:33:07.408725 | finish at 2025-09-10 11:49:21 + [2025-09-10 04:16:19] iteration 7088/ 11920 | consumed samples: 7258112 | elapsed time per iteration (ms): 5810.8 | throughput per GPU (TFLOP/s/GPU): 77.7 | MFU 7.86% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857489E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:47:58.014069 | finish at 2025-09-10 12:04:17 + [2025-09-10 04:16:25] iteration 7089/ 11920 | consumed samples: 7259136 | elapsed time per iteration (ms): 5926.5 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843285E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:57:10.874672 | finish at 2025-09-10 12:13:36 + [2025-09-10 04:16:31] iteration 7090/ 11920 | consumed samples: 7260160 | elapsed time per iteration (ms): 5619.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867687E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:32:23.131206 | finish at 2025-09-10 11:48:54 + [2025-09-10 04:16:37] iteration 7091/ 11920 | consumed samples: 7261184 | elapsed time per iteration (ms): 6153.8 | throughput per GPU (TFLOP/s/GPU): 73.4 | MFU 7.42% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846530E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:15:16.636929 | finish at 2025-09-10 12:31:54 + [2025-09-10 04:16:43] iteration 7092/ 11920 | consumed samples: 7262208 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.863333E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:32:26.347138 | finish at 2025-09-10 11:49:09 + [2025-09-10 04:16:48] iteration 7093/ 11920 | consumed samples: 7263232 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.865376E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:32:12.311761 | finish at 2025-09-10 11:49:01 + [2025-09-10 04:16:54] iteration 7094/ 11920 | consumed samples: 7264256 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858870E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:32:09.260121 | finish at 2025-09-10 11:49:03 + [2025-09-10 04:16:59] iteration 7095/ 11920 | consumed samples: 7265280 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842585E+00 | loss scale: 1.0 | grad norm: 0.251 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:32:06.499611 | finish at 2025-09-10 11:49:06 + [2025-09-10 04:17:05] iteration 7096/ 11920 | consumed samples: 7266304 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862592E+00 | loss scale: 1.0 | grad norm: 0.243 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:32:09.408062 | finish at 2025-09-10 11:49:15 + [2025-09-10 04:17:11] iteration 7097/ 11920 | consumed samples: 7267328 | elapsed time per iteration (ms): 5619.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855972E+00 | loss scale: 1.0 | grad norm: 0.251 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:31:43.680645 | finish at 2025-09-10 11:48:54 + [2025-09-10 04:17:16] iteration 7098/ 11920 | consumed samples: 7268352 | elapsed time per iteration (ms): 5619.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.853426E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:31:37.506839 | finish at 2025-09-10 11:48:54 + [2025-09-10 04:17:22] iteration 7099/ 11920 | consumed samples: 7269376 | elapsed time per iteration (ms): 5942.3 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855719E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:57:27.874668 | finish at 2025-09-10 12:14:50 + [2025-09-10 04:17:28] iteration 7100/ 11920 | consumed samples: 7270400 | elapsed time per iteration (ms): 5626.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851311E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:31:57.335739 | finish at 2025-09-10 11:49:25 + [2025-09-10 04:17:34] iteration 7101/ 11920 | consumed samples: 7271424 | elapsed time per iteration (ms): 5630.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867063E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:32:13.023706 | finish at 2025-09-10 11:49:47 + [2025-09-10 04:17:39] iteration 7102/ 11920 | consumed samples: 7272448 | elapsed time per iteration (ms): 5630.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861137E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:32:05.655295 | finish at 2025-09-10 11:49:45 + [2025-09-10 04:17:45] iteration 7103/ 11920 | consumed samples: 7273472 | elapsed time per iteration (ms): 5636.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849271E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:32:30.244718 | finish at 2025-09-10 11:50:15 + [2025-09-10 04:17:50] iteration 7104/ 11920 | consumed samples: 7274496 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870290E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:31:12.681339 | finish at 2025-09-10 11:49:03 + [2025-09-10 04:17:56] iteration 7105/ 11920 | consumed samples: 7275520 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.866842E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:30:59.771376 | finish at 2025-09-10 11:48:56 + [2025-09-10 04:18:02] iteration 7106/ 11920 | consumed samples: 7276544 | elapsed time per iteration (ms): 5834.6 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.869530E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:48:07.607023 | finish at 2025-09-10 12:06:10 + [2025-09-10 04:18:08] iteration 7107/ 11920 | consumed samples: 7277568 | elapsed time per iteration (ms): 5619.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.864550E+00 | loss scale: 1.0 | grad norm: 0.238 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:30:45.465452 | finish at 2025-09-10 11:48:53 + [2025-09-10 04:18:13] iteration 7108/ 11920 | consumed samples: 7278592 | elapsed time per iteration (ms): 5617.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862589E+00 | loss scale: 1.0 | grad norm: 0.254 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:30:29.359002 | finish at 2025-09-10 11:48:43 + [2025-09-10 04:18:19] iteration 7109/ 11920 | consumed samples: 7279616 | elapsed time per iteration (ms): 5614.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857692E+00 | loss scale: 1.0 | grad norm: 0.260 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:30:11.913737 | finish at 2025-09-10 11:48:31 + [2025-09-10 04:18:24] iteration 7110/ 11920 | consumed samples: 7280640 | elapsed time per iteration (ms): 5618.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845077E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:30:24.025106 | finish at 2025-09-10 11:48:48 + [2025-09-10 04:18:30] iteration 7111/ 11920 | consumed samples: 7281664 | elapsed time per iteration (ms): 5628.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862452E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:31:08.708465 | finish at 2025-09-10 11:49:39 + [2025-09-10 04:18:36] iteration 7112/ 11920 | consumed samples: 7282688 | elapsed time per iteration (ms): 5628.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851837E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:31:00.391592 | finish at 2025-09-10 11:49:36 + [2025-09-10 04:18:41] iteration 7113/ 11920 | consumed samples: 7283712 | elapsed time per iteration (ms): 5630.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.864814E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:31:06.294083 | finish at 2025-09-10 11:49:48 + [2025-09-10 04:18:47] iteration 7114/ 11920 | consumed samples: 7284736 | elapsed time per iteration (ms): 5617.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861707E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:29:58.478763 | finish at 2025-09-10 11:48:45 + [2025-09-10 04:18:53] iteration 7115/ 11920 | consumed samples: 7285760 | elapsed time per iteration (ms): 5619.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846502E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:30:02.295128 | finish at 2025-09-10 11:48:55 + [2025-09-10 04:18:58] iteration 7116/ 11920 | consumed samples: 7286784 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859111E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:30:06.792494 | finish at 2025-09-10 11:49:05 + [2025-09-10 04:19:04] iteration 7117/ 11920 | consumed samples: 7287808 | elapsed time per iteration (ms): 5616.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861011E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:29:37.803354 | finish at 2025-09-10 11:48:42 + [2025-09-10 04:19:10] iteration 7118/ 11920 | consumed samples: 7288832 | elapsed time per iteration (ms): 5992.3 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848671E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:59:34.897679 | finish at 2025-09-10 12:18:45 + [2025-09-10 04:19:15] iteration 7119/ 11920 | consumed samples: 7289856 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867135E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:29:41.889587 | finish at 2025-09-10 11:48:57 + [2025-09-10 04:19:22] iteration 7120/ 11920 | consumed samples: 7290880 | elapsed time per iteration (ms): 6364.1 | throughput per GPU (TFLOP/s/GPU): 70.9 | MFU 7.17% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845674E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:29:07.451019 | finish at 2025-09-10 12:48:29 + [2025-09-10 04:19:28] iteration 7121/ 11920 | consumed samples: 7291904 | elapsed time per iteration (ms): 5828.3 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857247E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:46:09.953344 | finish at 2025-09-10 12:05:37 + [2025-09-10 04:19:33] iteration 7122/ 11920 | consumed samples: 7292928 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858843E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:29:31.857551 | finish at 2025-09-10 11:49:05 + [2025-09-10 04:19:39] iteration 7123/ 11920 | consumed samples: 7293952 | elapsed time per iteration (ms): 5633.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843081E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:30:23.431061 | finish at 2025-09-10 11:50:02 + [2025-09-10 04:19:45] iteration 7124/ 11920 | consumed samples: 7294976 | elapsed time per iteration (ms): 6139.3 | throughput per GPU (TFLOP/s/GPU): 73.5 | MFU 7.44% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860829E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:10:44.214815 | finish at 2025-09-10 12:30:29 + [2025-09-10 04:19:51] iteration 7125/ 11920 | consumed samples: 7296000 | elapsed time per iteration (ms): 5874.9 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861923E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:49:30.272889 | finish at 2025-09-10 12:09:21 + [2025-09-10 04:19:56] iteration 7126/ 11920 | consumed samples: 7297024 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858649E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:29:09.561368 | finish at 2025-09-10 11:49:06 + [2025-09-10 04:20:02] iteration 7127/ 11920 | consumed samples: 7298048 | elapsed time per iteration (ms): 5980.8 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860221E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:57:45.754579 | finish at 2025-09-10 12:17:48 + [2025-09-10 04:20:08] iteration 7128/ 11920 | consumed samples: 7299072 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860883E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:28:54.362989 | finish at 2025-09-10 11:49:02 + [2025-09-10 04:20:14] iteration 7129/ 11920 | consumed samples: 7300096 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857869E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:28:47.074591 | finish at 2025-09-10 11:49:01 + [2025-09-10 04:20:19] iteration 7130/ 11920 | consumed samples: 7301120 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.869519E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:28:55.903146 | finish at 2025-09-10 11:49:15 + [2025-09-10 04:20:25] iteration 7131/ 11920 | consumed samples: 7302144 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856522E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:28:42.979201 | finish at 2025-09-10 11:49:08 + [2025-09-10 04:20:31] iteration 7132/ 11920 | consumed samples: 7303168 | elapsed time per iteration (ms): 5627.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860117E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:29:05.038765 | finish at 2025-09-10 11:49:36 + [2025-09-10 04:20:36] iteration 7133/ 11920 | consumed samples: 7304192 | elapsed time per iteration (ms): 5648.9 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859632E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:30:41.245649 | finish at 2025-09-10 11:51:17 + [2025-09-10 04:20:42] iteration 7134/ 11920 | consumed samples: 7305216 | elapsed time per iteration (ms): 5631.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855938E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:29:11.495236 | finish at 2025-09-10 11:49:53 + [2025-09-10 04:20:47] iteration 7135/ 11920 | consumed samples: 7306240 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861106E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:28:13.451772 | finish at 2025-09-10 11:49:01 + [2025-09-10 04:20:53] iteration 7136/ 11920 | consumed samples: 7307264 | elapsed time per iteration (ms): 5626.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855623E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:28:34.750576 | finish at 2025-09-10 11:49:28 + [2025-09-10 04:20:59] iteration 7137/ 11920 | consumed samples: 7308288 | elapsed time per iteration (ms): 5618.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857097E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:27:55.242324 | finish at 2025-09-10 11:48:54 + [2025-09-10 04:21:04] iteration 7138/ 11920 | consumed samples: 7309312 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844794E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:28:05.479031 | finish at 2025-09-10 11:49:10 + [2025-09-10 04:21:10] iteration 7139/ 11920 | consumed samples: 7310336 | elapsed time per iteration (ms): 5623.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.875557E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:28:03.939853 | finish at 2025-09-10 11:49:14 + [2025-09-10 04:21:16] iteration 7140/ 11920 | consumed samples: 7311360 | elapsed time per iteration (ms): 5627.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840645E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:28:19.177899 | finish at 2025-09-10 11:49:35 + [2025-09-10 04:21:21] iteration 7141/ 11920 | consumed samples: 7312384 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845512E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:27:50.907113 | finish at 2025-09-10 11:49:12 + [2025-09-10 04:21:27] iteration 7142/ 11920 | consumed samples: 7313408 | elapsed time per iteration (ms): 5628.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867005E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:28:12.753068 | finish at 2025-09-10 11:49:40 + [2025-09-10 04:21:32] iteration 7143/ 11920 | consumed samples: 7314432 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857718E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:27:46.606871 | finish at 2025-09-10 11:49:19 + [2025-09-10 04:21:38] iteration 7144/ 11920 | consumed samples: 7315456 | elapsed time per iteration (ms): 5629.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.872697E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:28:07.416195 | finish at 2025-09-10 11:49:45 + [2025-09-10 04:21:44] iteration 7145/ 11920 | consumed samples: 7316480 | elapsed time per iteration (ms): 5629.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857029E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:27:59.416251 | finish at 2025-09-10 11:49:43 + [2025-09-10 04:21:49] iteration 7146/ 11920 | consumed samples: 7317504 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854802E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:27:18.298795 | finish at 2025-09-10 11:49:08 + [2025-09-10 04:21:55] iteration 7147/ 11920 | consumed samples: 7318528 | elapsed time per iteration (ms): 5618.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858977E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:26:55.219405 | finish at 2025-09-10 11:48:50 + [2025-09-10 04:22:01] iteration 7148/ 11920 | consumed samples: 7319552 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.864971E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:27:15.355034 | finish at 2025-09-10 11:49:16 + [2025-09-10 04:22:06] iteration 7149/ 11920 | consumed samples: 7320576 | elapsed time per iteration (ms): 5629.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860409E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:27:38.526081 | finish at 2025-09-10 11:49:45 + [2025-09-10 04:22:12] iteration 7150/ 11920 | consumed samples: 7321600 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862614E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:26:58.613942 | finish at 2025-09-10 11:49:10 + [2025-09-10 04:22:18] iteration 7151/ 11920 | consumed samples: 7322624 | elapsed time per iteration (ms): 5952.6 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870481E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:53:08.007399 | finish at 2025-09-10 12:15:26 + [2025-09-10 04:22:23] iteration 7152/ 11920 | consumed samples: 7323648 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848241E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:26:46.824722 | finish at 2025-09-10 11:49:10 +(min, max) time across ranks (ms): + save-checkpoint ................................: (3942.20, 3942.38) + [2025-09-10 04:22:33] iteration 7153/ 11920 | consumed samples: 7324672 | elapsed time per iteration (ms): 5922.3 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858155E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:50:31.494334 | finish at 2025-09-10 12:13:05 + [2025-09-10 04:22:39] iteration 7154/ 11920 | consumed samples: 7325696 | elapsed time per iteration (ms): 5851.0 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846354E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:44:45.875427 | finish at 2025-09-10 12:07:25 + [2025-09-10 04:22:45] iteration 7155/ 11920 | consumed samples: 7326720 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835436E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:26:37.224281 | finish at 2025-09-10 11:49:22 + [2025-09-10 04:22:50] iteration 7156/ 11920 | consumed samples: 7327744 | elapsed time per iteration (ms): 5627.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845359E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:26:47.260154 | finish at 2025-09-10 11:49:38 + [2025-09-10 04:22:56] iteration 7157/ 11920 | consumed samples: 7328768 | elapsed time per iteration (ms): 5619.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851269E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:26:04.213219 | finish at 2025-09-10 11:49:00 + [2025-09-10 04:23:02] iteration 7158/ 11920 | consumed samples: 7329792 | elapsed time per iteration (ms): 5918.4 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859396E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:49:43.416493 | finish at 2025-09-10 12:12:45 + [2025-09-10 04:23:08] iteration 7159/ 11920 | consumed samples: 7330816 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859072E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:25:57.207663 | finish at 2025-09-10 11:49:05 + [2025-09-10 04:23:13] iteration 7160/ 11920 | consumed samples: 7331840 | elapsed time per iteration (ms): 5915.7 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845402E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:49:18.754501 | finish at 2025-09-10 12:12:32 + [2025-09-10 04:23:19] iteration 7161/ 11920 | consumed samples: 7332864 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855648E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:25:43.075318 | finish at 2025-09-10 11:49:02 + [2025-09-10 04:23:25] iteration 7162/ 11920 | consumed samples: 7333888 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860479E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:25:49.540561 | finish at 2025-09-10 11:49:14 + [2025-09-10 04:23:30] iteration 7163/ 11920 | consumed samples: 7334912 | elapsed time per iteration (ms): 5630.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862626E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:26:23.991723 | finish at 2025-09-10 11:49:54 + [2025-09-10 04:23:36] iteration 7164/ 11920 | consumed samples: 7335936 | elapsed time per iteration (ms): 5868.9 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843829E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:45:12.410448 | finish at 2025-09-10 12:08:49 + [2025-09-10 04:23:42] iteration 7165/ 11920 | consumed samples: 7336960 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.863488E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:25:42.802820 | finish at 2025-09-10 11:49:25 + [2025-09-10 04:23:47] iteration 7166/ 11920 | consumed samples: 7337984 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862118E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:25:21.572315 | finish at 2025-09-10 11:49:09 + [2025-09-10 04:23:53] iteration 7167/ 11920 | consumed samples: 7339008 | elapsed time per iteration (ms): 5963.6 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862089E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:52:24.932628 | finish at 2025-09-10 12:16:18 + [2025-09-10 04:23:59] iteration 7168/ 11920 | consumed samples: 7340032 | elapsed time per iteration (ms): 5838.2 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854321E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:42:23.326241 | finish at 2025-09-10 12:06:23 + [2025-09-10 04:24:05] iteration 7169/ 11920 | consumed samples: 7341056 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861869E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:25:35.410023 | finish at 2025-09-10 11:49:40 + [2025-09-10 04:24:10] iteration 7170/ 11920 | consumed samples: 7342080 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.853395E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:25:13.663995 | finish at 2025-09-10 11:49:24 + [2025-09-10 04:24:16] iteration 7171/ 11920 | consumed samples: 7343104 | elapsed time per iteration (ms): 5632.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843965E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:25:46.736969 | finish at 2025-09-10 11:50:03 + [2025-09-10 04:24:22] iteration 7172/ 11920 | consumed samples: 7344128 | elapsed time per iteration (ms): 5634.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854253E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:25:54.449041 | finish at 2025-09-10 11:50:16 + [2025-09-10 04:24:27] iteration 7173/ 11920 | consumed samples: 7345152 | elapsed time per iteration (ms): 5627.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849684E+00 | loss scale: 1.0 | grad norm: 0.230 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:25:11.594666 | finish at 2025-09-10 11:49:39 + [2025-09-10 04:24:33] iteration 7174/ 11920 | consumed samples: 7346176 | elapsed time per iteration (ms): 5632.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854443E+00 | loss scale: 1.0 | grad norm: 0.256 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:25:32.026860 | finish at 2025-09-10 11:50:05 + [2025-09-10 04:24:39] iteration 7175/ 11920 | consumed samples: 7347200 | elapsed time per iteration (ms): 5639.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856466E+00 | loss scale: 1.0 | grad norm: 0.240 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:25:59.458715 | finish at 2025-09-10 11:50:38 + [2025-09-10 04:24:44] iteration 7176/ 11920 | consumed samples: 7348224 | elapsed time per iteration (ms): 5637.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.863468E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:25:44.062704 | finish at 2025-09-10 11:50:28 + [2025-09-10 04:24:50] iteration 7177/ 11920 | consumed samples: 7349248 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859353E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:24:28.304279 | finish at 2025-09-10 11:49:18 + [2025-09-10 04:24:56] iteration 7178/ 11920 | consumed samples: 7350272 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855616E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:24:27.540850 | finish at 2025-09-10 11:49:23 + [2025-09-10 04:25:01] iteration 7179/ 11920 | consumed samples: 7351296 | elapsed time per iteration (ms): 5619.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854883E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:23:59.444820 | finish at 2025-09-10 11:49:01 + [2025-09-10 04:25:07] iteration 7180/ 11920 | consumed samples: 7352320 | elapsed time per iteration (ms): 5978.3 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860981E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:52:16.941290 | finish at 2025-09-10 12:17:24 + [2025-09-10 04:25:13] iteration 7181/ 11920 | consumed samples: 7353344 | elapsed time per iteration (ms): 5976.7 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858180E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:52:03.356777 | finish at 2025-09-10 12:17:16 + [2025-09-10 04:25:19] iteration 7182/ 11920 | consumed samples: 7354368 | elapsed time per iteration (ms): 6051.1 | throughput per GPU (TFLOP/s/GPU): 74.6 | MFU 7.54% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861381E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:57:50.219189 | finish at 2025-09-10 12:23:09 + [2025-09-10 04:25:25] iteration 7183/ 11920 | consumed samples: 7355392 | elapsed time per iteration (ms): 6000.9 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859357E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:53:46.240855 | finish at 2025-09-10 12:19:11 + [2025-09-10 04:25:31] iteration 7184/ 11920 | consumed samples: 7356416 | elapsed time per iteration (ms): 5854.4 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852200E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:42:06.471893 | finish at 2025-09-10 12:07:37 + [2025-09-10 04:25:37] iteration 7185/ 11920 | consumed samples: 7357440 | elapsed time per iteration (ms): 5632.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845503E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:24:29.323857 | finish at 2025-09-10 11:50:06 + [2025-09-10 04:25:42] iteration 7186/ 11920 | consumed samples: 7358464 | elapsed time per iteration (ms): 5634.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859538E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:24:35.777312 | finish at 2025-09-10 11:50:18 + [2025-09-10 04:25:48] iteration 7187/ 11920 | consumed samples: 7359488 | elapsed time per iteration (ms): 5619.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858847E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:23:14.945718 | finish at 2025-09-10 11:49:03 + [2025-09-10 04:25:54] iteration 7188/ 11920 | consumed samples: 7360512 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852634E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:23:14.963142 | finish at 2025-09-10 11:49:08 + [2025-09-10 04:25:59] iteration 7189/ 11920 | consumed samples: 7361536 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857519E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:23:12.521492 | finish at 2025-09-10 11:49:12 + [2025-09-10 04:26:05] iteration 7190/ 11920 | consumed samples: 7362560 | elapsed time per iteration (ms): 5619.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852013E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:22:57.947614 | finish at 2025-09-10 11:49:03 + [2025-09-10 04:26:10] iteration 7191/ 11920 | consumed samples: 7363584 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845947E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:23:08.116721 | finish at 2025-09-10 11:49:18 + [2025-09-10 04:26:16] iteration 7192/ 11920 | consumed samples: 7364608 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855689E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:22:55.030890 | finish at 2025-09-10 11:49:11 + [2025-09-10 04:26:22] iteration 7193/ 11920 | consumed samples: 7365632 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858720E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:22:52.735904 | finish at 2025-09-10 11:49:14 + [2025-09-10 04:26:27] iteration 7194/ 11920 | consumed samples: 7366656 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.847494E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:22:56.982642 | finish at 2025-09-10 11:49:24 + [2025-09-10 04:26:33] iteration 7195/ 11920 | consumed samples: 7367680 | elapsed time per iteration (ms): 5896.3 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856421E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:44:19.872544 | finish at 2025-09-10 12:10:53 + [2025-09-10 04:26:39] iteration 7196/ 11920 | consumed samples: 7368704 | elapsed time per iteration (ms): 5630.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852266E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:23:17.252461 | finish at 2025-09-10 11:49:56 + [2025-09-10 04:26:45] iteration 7197/ 11920 | consumed samples: 7369728 | elapsed time per iteration (ms): 5968.1 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.869119E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:49:47.381677 | finish at 2025-09-10 12:16:32 + [2025-09-10 04:26:50] iteration 7198/ 11920 | consumed samples: 7370752 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.868613E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:22:43.909169 | finish at 2025-09-10 11:49:34 + [2025-09-10 04:26:56] iteration 7199/ 11920 | consumed samples: 7371776 | elapsed time per iteration (ms): 5619.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854394E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:22:09.399123 | finish at 2025-09-10 11:49:05 + [2025-09-10 04:27:02] iteration 7200/ 11920 | consumed samples: 7372800 | elapsed time per iteration (ms): 5618.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861213E+00 | loss scale: 1.0 | grad norm: 0.229 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:21:59.109535 | finish at 2025-09-10 11:49:01 + [2025-09-10 04:27:07] iteration 7201/ 11920 | consumed samples: 7373824 | elapsed time per iteration (ms): 5613.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.863696E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:21:32.099605 | finish at 2025-09-10 11:48:39 + [2025-09-10 04:27:13] iteration 7202/ 11920 | consumed samples: 7374848 | elapsed time per iteration (ms): 5618.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.874763E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:21:46.286573 | finish at 2025-09-10 11:48:59 + [2025-09-10 04:27:18] iteration 7203/ 11920 | consumed samples: 7375872 | elapsed time per iteration (ms): 5618.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.863097E+00 | loss scale: 1.0 | grad norm: 0.240 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:21:42.052862 | finish at 2025-09-10 11:49:00 + [2025-09-10 04:27:24] iteration 7204/ 11920 | consumed samples: 7376896 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858440E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:21:53.733067 | finish at 2025-09-10 11:49:18 + [2025-09-10 04:27:30] iteration 7205/ 11920 | consumed samples: 7377920 | elapsed time per iteration (ms): 5629.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.865834E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:22:22.618822 | finish at 2025-09-10 11:49:52 + [2025-09-10 04:27:35] iteration 7206/ 11920 | consumed samples: 7378944 | elapsed time per iteration (ms): 5631.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850508E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:22:24.478003 | finish at 2025-09-10 11:50:00 + [2025-09-10 04:27:41] iteration 7207/ 11920 | consumed samples: 7379968 | elapsed time per iteration (ms): 5628.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856243E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:22:08.812670 | finish at 2025-09-10 11:49:50 + [2025-09-10 04:27:47] iteration 7208/ 11920 | consumed samples: 7380992 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859435E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:21:38.007782 | finish at 2025-09-10 11:49:25 + [2025-09-10 04:27:52] iteration 7209/ 11920 | consumed samples: 7382016 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.864563E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:21:23.926646 | finish at 2025-09-10 11:49:16 + [2025-09-10 04:27:58] iteration 7210/ 11920 | consumed samples: 7383040 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.863748E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:21:13.511045 | finish at 2025-09-10 11:49:11 + [2025-09-10 04:28:03] iteration 7211/ 11920 | consumed samples: 7384064 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854236E+00 | loss scale: 1.0 | grad norm: 0.258 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:21:26.636282 | finish at 2025-09-10 11:49:30 + [2025-09-10 04:28:09] iteration 7212/ 11920 | consumed samples: 7385088 | elapsed time per iteration (ms): 5618.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857396E+00 | loss scale: 1.0 | grad norm: 0.245 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:20:51.461330 | finish at 2025-09-10 11:49:01 + [2025-09-10 04:28:15] iteration 7213/ 11920 | consumed samples: 7386112 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.869124E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:21:28.503612 | finish at 2025-09-10 11:49:43 + [2025-09-10 04:28:20] iteration 7214/ 11920 | consumed samples: 7387136 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867414E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:21:23.096053 | finish at 2025-09-10 11:49:43 + [2025-09-10 04:28:26] iteration 7215/ 11920 | consumed samples: 7388160 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.868722E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:20:54.977260 | finish at 2025-09-10 11:49:21 + [2025-09-10 04:28:32] iteration 7216/ 11920 | consumed samples: 7389184 | elapsed time per iteration (ms): 5628.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843563E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:21:16.415703 | finish at 2025-09-10 11:49:48 + [2025-09-10 04:28:37] iteration 7217/ 11920 | consumed samples: 7390208 | elapsed time per iteration (ms): 5640.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849151E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:22:05.612325 | finish at 2025-09-10 11:50:43 + [2025-09-10 04:28:43] iteration 7218/ 11920 | consumed samples: 7391232 | elapsed time per iteration (ms): 5629.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858762E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:21:07.668741 | finish at 2025-09-10 11:49:51 + [2025-09-10 04:28:48] iteration 7219/ 11920 | consumed samples: 7392256 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858099E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:20:38.994831 | finish at 2025-09-10 11:49:27 + [2025-09-10 04:28:54] iteration 7220/ 11920 | consumed samples: 7393280 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851948E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:20:34.562993 | finish at 2025-09-10 11:49:29 + [2025-09-10 04:29:00] iteration 7221/ 11920 | consumed samples: 7394304 | elapsed time per iteration (ms): 5619.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.872189E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:20:06.817724 | finish at 2025-09-10 11:49:07 + [2025-09-10 04:29:05] iteration 7222/ 11920 | consumed samples: 7395328 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.869569E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:20:09.081253 | finish at 2025-09-10 11:49:14 + [2025-09-10 04:29:11] iteration 7223/ 11920 | consumed samples: 7396352 | elapsed time per iteration (ms): 5947.8 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858301E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:45:36.813989 | finish at 2025-09-10 12:14:48 + [2025-09-10 04:29:17] iteration 7224/ 11920 | consumed samples: 7397376 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.847363E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:19:54.450611 | finish at 2025-09-10 11:49:11 + [2025-09-10 04:29:23] iteration 7225/ 11920 | consumed samples: 7398400 | elapsed time per iteration (ms): 5996.2 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854605E+00 | loss scale: 1.0 | grad norm: 0.134 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:49:12.289245 | finish at 2025-09-10 12:18:35 + [2025-09-10 04:29:29] iteration 7226/ 11920 | consumed samples: 7399424 | elapsed time per iteration (ms): 5628.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846070E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:20:20.804526 | finish at 2025-09-10 11:49:49 + [2025-09-10 04:29:34] iteration 7227/ 11920 | consumed samples: 7400448 | elapsed time per iteration (ms): 5627.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851044E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:20:09.565736 | finish at 2025-09-10 11:49:44 + [2025-09-10 04:29:40] iteration 7228/ 11920 | consumed samples: 7401472 | elapsed time per iteration (ms): 5635.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858407E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:20:39.620195 | finish at 2025-09-10 11:50:19 + [2025-09-10 04:29:45] iteration 7229/ 11920 | consumed samples: 7402496 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.864549E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:19:24.041306 | finish at 2025-09-10 11:49:09 + [2025-09-10 04:29:51] iteration 7230/ 11920 | consumed samples: 7403520 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861791E+00 | loss scale: 1.0 | grad norm: 0.249 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:19:27.089329 | finish at 2025-09-10 11:49:18 + [2025-09-10 04:29:57] iteration 7231/ 11920 | consumed samples: 7404544 | elapsed time per iteration (ms): 5626.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855365E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:19:42.053182 | finish at 2025-09-10 11:49:39 + [2025-09-10 04:30:02] iteration 7232/ 11920 | consumed samples: 7405568 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860215E+00 | loss scale: 1.0 | grad norm: 0.246 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:19:26.320511 | finish at 2025-09-10 11:49:29 + [2025-09-10 04:30:08] iteration 7233/ 11920 | consumed samples: 7406592 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.847508E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:19:12.251591 | finish at 2025-09-10 11:49:20 + [2025-09-10 04:30:14] iteration 7234/ 11920 | consumed samples: 7407616 | elapsed time per iteration (ms): 5630.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857642E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:19:43.046389 | finish at 2025-09-10 11:49:57 + [2025-09-10 04:30:19] iteration 7235/ 11920 | consumed samples: 7408640 | elapsed time per iteration (ms): 5634.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857382E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 14.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:19:55.296997 | finish at 2025-09-10 11:50:14 + [2025-09-10 04:30:25] iteration 7236/ 11920 | consumed samples: 7409664 | elapsed time per iteration (ms): 5631.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.865700E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:19:35.776176 | finish at 2025-09-10 11:50:01 + [2025-09-10 04:30:31] iteration 7237/ 11920 | consumed samples: 7410688 | elapsed time per iteration (ms): 5876.1 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.864653E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:38:37.724120 | finish at 2025-09-10 12:09:08 + [2025-09-10 04:30:36] iteration 7238/ 11920 | consumed samples: 7411712 | elapsed time per iteration (ms): 5637.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857669E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:19:54.007226 | finish at 2025-09-10 11:50:30 + [2025-09-10 04:30:42] iteration 7239/ 11920 | consumed samples: 7412736 | elapsed time per iteration (ms): 5896.5 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.865602E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:40:01.523714 | finish at 2025-09-10 12:10:44 + [2025-09-10 04:30:48] iteration 7240/ 11920 | consumed samples: 7413760 | elapsed time per iteration (ms): 5629.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857002E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:19:04.089088 | finish at 2025-09-10 11:49:52 + [2025-09-10 04:30:53] iteration 7241/ 11920 | consumed samples: 7414784 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857682E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:18:37.935927 | finish at 2025-09-10 11:49:31 + [2025-09-10 04:30:59] iteration 7242/ 11920 | consumed samples: 7415808 | elapsed time per iteration (ms): 5958.7 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.863813E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:44:34.917666 | finish at 2025-09-10 12:15:34 + [2025-09-10 04:31:05] iteration 7243/ 11920 | consumed samples: 7416832 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852489E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:18:24.039333 | finish at 2025-09-10 11:49:29 + [2025-09-10 04:31:11] iteration 7244/ 11920 | consumed samples: 7417856 | elapsed time per iteration (ms): 5619.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851029E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:17:55.835131 | finish at 2025-09-10 11:49:07 + [2025-09-10 04:31:16] iteration 7245/ 11920 | consumed samples: 7418880 | elapsed time per iteration (ms): 5619.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.853757E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:17:52.138530 | finish at 2025-09-10 11:49:08 + [2025-09-10 04:31:22] iteration 7246/ 11920 | consumed samples: 7419904 | elapsed time per iteration (ms): 5839.5 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859881E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:34:53.638897 | finish at 2025-09-10 12:06:16 + [2025-09-10 04:31:28] iteration 7247/ 11920 | consumed samples: 7420928 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859409E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:17:49.113593 | finish at 2025-09-10 11:49:17 + [2025-09-10 04:31:33] iteration 7248/ 11920 | consumed samples: 7421952 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844311E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:17:49.386841 | finish at 2025-09-10 11:49:23 + [2025-09-10 04:31:39] iteration 7249/ 11920 | consumed samples: 7422976 | elapsed time per iteration (ms): 5629.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851424E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:18:14.840605 | finish at 2025-09-10 11:49:54 + [2025-09-10 04:31:45] iteration 7250/ 11920 | consumed samples: 7424000 | elapsed time per iteration (ms): 5628.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859298E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:18:03.314579 | finish at 2025-09-10 11:49:48 + [2025-09-10 04:31:50] iteration 7251/ 11920 | consumed samples: 7425024 | elapsed time per iteration (ms): 5616.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860566E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:17:02.645455 | finish at 2025-09-10 11:48:53 + [2025-09-10 04:31:56] iteration 7252/ 11920 | consumed samples: 7426048 | elapsed time per iteration (ms): 5618.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.866807E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:17:05.221461 | finish at 2025-09-10 11:49:01 + [2025-09-10 04:32:01] iteration 7253/ 11920 | consumed samples: 7427072 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850473E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:17:23.899170 | finish at 2025-09-10 11:49:25 + [2025-09-10 04:32:07] iteration 7254/ 11920 | consumed samples: 7428096 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849990E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:17:10.111527 | finish at 2025-09-10 11:49:17 + [2025-09-10 04:32:13] iteration 7255/ 11920 | consumed samples: 7429120 | elapsed time per iteration (ms): 5619.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859367E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:16:54.580082 | finish at 2025-09-10 11:49:07 + [2025-09-10 04:32:18] iteration 7256/ 11920 | consumed samples: 7430144 | elapsed time per iteration (ms): 5617.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841870E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:16:39.733419 | finish at 2025-09-10 11:48:58 + [2025-09-10 04:32:24] iteration 7257/ 11920 | consumed samples: 7431168 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855004E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:17:05.947488 | finish at 2025-09-10 11:49:30 + [2025-09-10 04:32:30] iteration 7258/ 11920 | consumed samples: 7432192 | elapsed time per iteration (ms): 5867.1 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867611E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:35:52.385745 | finish at 2025-09-10 12:08:22 + [2025-09-10 04:32:35] iteration 7259/ 11920 | consumed samples: 7433216 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854098E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:17:01.080976 | finish at 2025-09-10 11:49:37 + [2025-09-10 04:32:41] iteration 7260/ 11920 | consumed samples: 7434240 | elapsed time per iteration (ms): 5628.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860505E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:17:07.000060 | finish at 2025-09-10 11:49:48 + [2025-09-10 04:32:47] iteration 7261/ 11920 | consumed samples: 7435264 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837945E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:16:48.069102 | finish at 2025-09-10 11:49:35 + [2025-09-10 04:32:53] iteration 7262/ 11920 | consumed samples: 7436288 | elapsed time per iteration (ms): 5979.8 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838941E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:44:13.860580 | finish at 2025-09-10 12:17:07 + [2025-09-10 04:32:59] iteration 7263/ 11920 | consumed samples: 7437312 | elapsed time per iteration (ms): 5890.4 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840695E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:37:11.581384 | finish at 2025-09-10 12:10:10 + [2025-09-10 04:33:04] iteration 7264/ 11920 | consumed samples: 7438336 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.866444E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:16:21.645561 | finish at 2025-09-10 11:49:26 + [2025-09-10 04:33:10] iteration 7265/ 11920 | consumed samples: 7439360 | elapsed time per iteration (ms): 5616.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856144E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:15:43.906959 | finish at 2025-09-10 11:48:54 + [2025-09-10 04:33:15] iteration 7266/ 11920 | consumed samples: 7440384 | elapsed time per iteration (ms): 5618.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846051E+00 | loss scale: 1.0 | grad norm: 0.242 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:15:49.730629 | finish at 2025-09-10 11:49:05 + [2025-09-10 04:33:21] iteration 7267/ 11920 | consumed samples: 7441408 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852852E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:15:59.538648 | finish at 2025-09-10 11:49:21 + [2025-09-10 04:33:27] iteration 7268/ 11920 | consumed samples: 7442432 | elapsed time per iteration (ms): 5869.4 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855407E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:35:04.481078 | finish at 2025-09-10 12:08:31 + [2025-09-10 04:33:33] iteration 7269/ 11920 | consumed samples: 7443456 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851082E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:15:59.209241 | finish at 2025-09-10 11:49:32 + [2025-09-10 04:33:38] iteration 7270/ 11920 | consumed samples: 7444480 | elapsed time per iteration (ms): 5872.2 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840683E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:35:05.594802 | finish at 2025-09-10 12:08:44 + [2025-09-10 04:33:44] iteration 7271/ 11920 | consumed samples: 7445504 | elapsed time per iteration (ms): 5972.8 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850706E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:42:47.393208 | finish at 2025-09-10 12:16:32 + [2025-09-10 04:33:50] iteration 7272/ 11920 | consumed samples: 7446528 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852142E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:15:35.418766 | finish at 2025-09-10 11:49:25 + [2025-09-10 04:33:56] iteration 7273/ 11920 | consumed samples: 7447552 | elapsed time per iteration (ms): 5934.3 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.863574E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:39:36.716223 | finish at 2025-09-10 12:13:33 + [2025-09-10 04:34:02] iteration 7274/ 11920 | consumed samples: 7448576 | elapsed time per iteration (ms): 5919.3 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852296E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:38:21.024969 | finish at 2025-09-10 12:12:23 + [2025-09-10 04:34:08] iteration 7275/ 11920 | consumed samples: 7449600 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848937E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:15:07.994803 | finish at 2025-09-10 11:49:15 + [2025-09-10 04:34:13] iteration 7276/ 11920 | consumed samples: 7450624 | elapsed time per iteration (ms): 5939.1 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870280E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:39:41.400862 | finish at 2025-09-10 12:13:55 + [2025-09-10 04:34:19] iteration 7277/ 11920 | consumed samples: 7451648 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860035E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:14:53.409292 | finish at 2025-09-10 11:49:12 + [2025-09-10 04:34:25] iteration 7278/ 11920 | consumed samples: 7452672 | elapsed time per iteration (ms): 5946.1 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854568E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:40:01.920139 | finish at 2025-09-10 12:14:27 + [2025-09-10 04:34:31] iteration 7279/ 11920 | consumed samples: 7453696 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.847988E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:15:17.433575 | finish at 2025-09-10 11:49:48 + [2025-09-10 04:34:36] iteration 7280/ 11920 | consumed samples: 7454720 | elapsed time per iteration (ms): 5636.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.866934E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:15:55.194740 | finish at 2025-09-10 11:50:31 + [2025-09-10 04:34:42] iteration 7281/ 11920 | consumed samples: 7455744 | elapsed time per iteration (ms): 5635.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852300E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:15:41.929599 | finish at 2025-09-10 11:50:24 + [2025-09-10 04:34:48] iteration 7282/ 11920 | consumed samples: 7456768 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835546E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:14:51.743361 | finish at 2025-09-10 11:49:39 + [2025-09-10 04:34:53] iteration 7283/ 11920 | consumed samples: 7457792 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859032E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:14:26.603706 | finish at 2025-09-10 11:49:20 + [2025-09-10 04:34:59] iteration 7284/ 11920 | consumed samples: 7458816 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838060E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:14:12.107747 | finish at 2025-09-10 11:49:11 + [2025-09-10 04:35:05] iteration 7285/ 11920 | consumed samples: 7459840 | elapsed time per iteration (ms): 5829.1 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.865484E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:30:17.826068 | finish at 2025-09-10 12:05:22 + [2025-09-10 04:35:10] iteration 7286/ 11920 | consumed samples: 7460864 | elapsed time per iteration (ms): 5821.7 | throughput per GPU (TFLOP/s/GPU): 77.6 | MFU 7.84% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850394E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:29:37.779237 | finish at 2025-09-10 12:04:48 + [2025-09-10 04:35:16] iteration 7287/ 11920 | consumed samples: 7461888 | elapsed time per iteration (ms): 5614.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857446E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:13:33.614615 | finish at 2025-09-10 11:48:50 + [2025-09-10 04:35:22] iteration 7288/ 11920 | consumed samples: 7462912 | elapsed time per iteration (ms): 5622.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852699E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:14:03.479370 | finish at 2025-09-10 11:49:25 + [2025-09-10 04:35:27] iteration 7289/ 11920 | consumed samples: 7463936 | elapsed time per iteration (ms): 5619.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852969E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:13:44.640584 | finish at 2025-09-10 11:49:12 + [2025-09-10 04:35:33] iteration 7290/ 11920 | consumed samples: 7464960 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850425E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:14:07.068257 | finish at 2025-09-10 11:49:40 + [2025-09-10 04:35:39] iteration 7291/ 11920 | consumed samples: 7465984 | elapsed time per iteration (ms): 5920.9 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862717E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:36:47.821209 | finish at 2025-09-10 12:12:27 + [2025-09-10 04:35:44] iteration 7292/ 11920 | consumed samples: 7467008 | elapsed time per iteration (ms): 5636.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856136E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:14:43.848981 | finish at 2025-09-10 11:50:28 + [2025-09-10 04:35:51] iteration 7293/ 11920 | consumed samples: 7468032 | elapsed time per iteration (ms): 6157.6 | throughput per GPU (TFLOP/s/GPU): 73.3 | MFU 7.41% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857497E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:54:51.325174 | finish at 2025-09-10 12:30:42 + [2025-09-10 04:35:57] iteration 7294/ 11920 | consumed samples: 7469056 | elapsed time per iteration (ms): 5969.8 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856913E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:40:16.365366 | finish at 2025-09-10 12:16:13 + [2025-09-10 04:36:02] iteration 7295/ 11920 | consumed samples: 7470080 | elapsed time per iteration (ms): 5618.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857024E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:13:03.481705 | finish at 2025-09-10 11:49:06 + [2025-09-10 04:36:08] iteration 7296/ 11920 | consumed samples: 7471104 | elapsed time per iteration (ms): 5628.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856576E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:13:43.845638 | finish at 2025-09-10 11:49:52 + [2025-09-10 04:36:13] iteration 7297/ 11920 | consumed samples: 7472128 | elapsed time per iteration (ms): 5623.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846456E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:13:15.643300 | finish at 2025-09-10 11:49:29 + [2025-09-10 04:36:19] iteration 7298/ 11920 | consumed samples: 7473152 | elapsed time per iteration (ms): 5618.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858764E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:12:47.255678 | finish at 2025-09-10 11:49:06 + [2025-09-10 04:36:25] iteration 7299/ 11920 | consumed samples: 7474176 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.864788E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:13:18.255767 | finish at 2025-09-10 11:49:43 + [2025-09-10 04:36:30] iteration 7300/ 11920 | consumed samples: 7475200 | elapsed time per iteration (ms): 5627.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.847459E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:13:21.005416 | finish at 2025-09-10 11:49:51 + [2025-09-10 04:36:36] iteration 7301/ 11920 | consumed samples: 7476224 | elapsed time per iteration (ms): 5846.2 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850041E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:30:03.430360 | finish at 2025-09-10 12:06:40 + [2025-09-10 04:36:42] iteration 7302/ 11920 | consumed samples: 7477248 | elapsed time per iteration (ms): 5626.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842735E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:13:05.208975 | finish at 2025-09-10 11:49:47 + [2025-09-10 04:36:47] iteration 7303/ 11920 | consumed samples: 7478272 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839902E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:12:27.066137 | finish at 2025-09-10 11:49:14 + [2025-09-10 04:36:53] iteration 7304/ 11920 | consumed samples: 7479296 | elapsed time per iteration (ms): 5619.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855115E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:12:18.324007 | finish at 2025-09-10 11:49:11 + [2025-09-10 04:36:59] iteration 7305/ 11920 | consumed samples: 7480320 | elapsed time per iteration (ms): 5615.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844421E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:11:57.526124 | finish at 2025-09-10 11:48:56 + [2025-09-10 04:37:04] iteration 7306/ 11920 | consumed samples: 7481344 | elapsed time per iteration (ms): 5616.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.847546E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:11:55.494198 | finish at 2025-09-10 11:49:00 + [2025-09-10 04:37:10] iteration 7307/ 11920 | consumed samples: 7482368 | elapsed time per iteration (ms): 5874.8 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.865741E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:31:40.399455 | finish at 2025-09-10 12:08:51 + [2025-09-10 04:37:16] iteration 7308/ 11920 | consumed samples: 7483392 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862341E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:12:27.438243 | finish at 2025-09-10 11:49:43 + [2025-09-10 04:37:22] iteration 7309/ 11920 | consumed samples: 7484416 | elapsed time per iteration (ms): 5969.7 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848299E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:38:46.406981 | finish at 2025-09-10 12:16:08 + [2025-09-10 04:37:27] iteration 7310/ 11920 | consumed samples: 7485440 | elapsed time per iteration (ms): 5630.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849883E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:12:38.191376 | finish at 2025-09-10 11:50:06 + [2025-09-10 04:37:33] iteration 7311/ 11920 | consumed samples: 7486464 | elapsed time per iteration (ms): 5991.4 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854843E+00 | loss scale: 1.0 | grad norm: 0.244 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:40:14.376901 | finish at 2025-09-10 12:17:48 + [2025-09-10 04:37:39] iteration 7312/ 11920 | consumed samples: 7487488 | elapsed time per iteration (ms): 5984.1 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857489E+00 | loss scale: 1.0 | grad norm: 0.252 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:39:34.831055 | finish at 2025-09-10 12:17:14 + [2025-09-10 04:37:45] iteration 7313/ 11920 | consumed samples: 7488512 | elapsed time per iteration (ms): 6005.3 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850801E+00 | loss scale: 1.0 | grad norm: 0.259 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:41:06.431587 | finish at 2025-09-10 12:18:52 + [2025-09-10 04:37:51] iteration 7314/ 11920 | consumed samples: 7489536 | elapsed time per iteration (ms): 5625.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858151E+00 | loss scale: 1.0 | grad norm: 0.267 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:11:52.976802 | finish at 2025-09-10 11:49:44 + [2025-09-10 04:37:57] iteration 7315/ 11920 | consumed samples: 7490560 | elapsed time per iteration (ms): 5628.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849588E+00 | loss scale: 1.0 | grad norm: 0.280 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:11:59.967055 | finish at 2025-09-10 11:49:57 + [2025-09-10 04:38:02] iteration 7316/ 11920 | consumed samples: 7491584 | elapsed time per iteration (ms): 5630.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861796E+00 | loss scale: 1.0 | grad norm: 0.297 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:12:01.948607 | finish at 2025-09-10 11:50:04 + [2025-09-10 04:38:08] iteration 7317/ 11920 | consumed samples: 7492608 | elapsed time per iteration (ms): 5633.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.877206E+00 | loss scale: 1.0 | grad norm: 0.332 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:12:12.935745 | finish at 2025-09-10 11:50:21 + [2025-09-10 04:38:14] iteration 7318/ 11920 | consumed samples: 7493632 | elapsed time per iteration (ms): 5627.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.873412E+00 | loss scale: 1.0 | grad norm: 0.292 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:11:35.440166 | finish at 2025-09-10 11:49:49 + [2025-09-10 04:38:19] iteration 7319/ 11920 | consumed samples: 7494656 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861184E+00 | loss scale: 1.0 | grad norm: 0.303 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:11:21.592522 | finish at 2025-09-10 11:49:41 + [2025-09-10 04:38:25] iteration 7320/ 11920 | consumed samples: 7495680 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.874647E+00 | loss scale: 1.0 | grad norm: 0.279 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:11:20.937672 | finish at 2025-09-10 11:49:46 + [2025-09-10 04:38:30] iteration 7321/ 11920 | consumed samples: 7496704 | elapsed time per iteration (ms): 5629.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854999E+00 | loss scale: 1.0 | grad norm: 0.257 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:11:28.352998 | finish at 2025-09-10 11:49:59 + [2025-09-10 04:38:36] iteration 7322/ 11920 | consumed samples: 7497728 | elapsed time per iteration (ms): 5651.5 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860909E+00 | loss scale: 1.0 | grad norm: 0.264 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:13:05.754798 | finish at 2025-09-10 11:51:42 + [2025-09-10 04:38:42] iteration 7323/ 11920 | consumed samples: 7498752 | elapsed time per iteration (ms): 5627.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.864031E+00 | loss scale: 1.0 | grad norm: 0.240 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:11:07.610968 | finish at 2025-09-10 11:49:49 + [2025-09-10 04:38:47] iteration 7324/ 11920 | consumed samples: 7499776 | elapsed time per iteration (ms): 5630.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857010E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:11:17.272113 | finish at 2025-09-10 11:50:05 + [2025-09-10 04:38:53] iteration 7325/ 11920 | consumed samples: 7500800 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.871080E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:10:37.119275 | finish at 2025-09-10 11:49:30 + [2025-09-10 04:38:59] iteration 7326/ 11920 | consumed samples: 7501824 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860263E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:10:23.793189 | finish at 2025-09-10 11:49:22 + [2025-09-10 04:39:04] iteration 7327/ 11920 | consumed samples: 7502848 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858982E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:10:20.496794 | finish at 2025-09-10 11:49:25 + [2025-09-10 04:39:10] iteration 7328/ 11920 | consumed samples: 7503872 | elapsed time per iteration (ms): 5971.5 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858198E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:37:01.005013 | finish at 2025-09-10 12:16:11 + [2025-09-10 04:39:16] iteration 7329/ 11920 | consumed samples: 7504896 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.882557E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:10:21.001505 | finish at 2025-09-10 11:49:37 + [2025-09-10 04:39:21] iteration 7330/ 11920 | consumed samples: 7505920 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.871198E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:09:55.006077 | finish at 2025-09-10 11:49:16 + [2025-09-10 04:39:27] iteration 7331/ 11920 | consumed samples: 7506944 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839267E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:09:53.023048 | finish at 2025-09-10 11:49:20 + [2025-09-10 04:39:33] iteration 7332/ 11920 | consumed samples: 7507968 | elapsed time per iteration (ms): 5627.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858643E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:10:20.786077 | finish at 2025-09-10 11:49:53 + [2025-09-10 04:39:38] iteration 7333/ 11920 | consumed samples: 7508992 | elapsed time per iteration (ms): 5635.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850595E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:10:50.469179 | finish at 2025-09-10 11:50:29 + [2025-09-10 04:39:44] iteration 7334/ 11920 | consumed samples: 7510016 | elapsed time per iteration (ms): 6052.1 | throughput per GPU (TFLOP/s/GPU): 74.6 | MFU 7.54% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856457E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:42:34.744463 | finish at 2025-09-10 12:22:19 + [2025-09-10 04:39:50] iteration 7335/ 11920 | consumed samples: 7511040 | elapsed time per iteration (ms): 5937.6 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855018E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:33:43.814636 | finish at 2025-09-10 12:13:34 + [2025-09-10 04:39:56] iteration 7336/ 11920 | consumed samples: 7512064 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851176E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:09:40.860054 | finish at 2025-09-10 11:49:37 + [2025-09-10 04:40:02] iteration 7337/ 11920 | consumed samples: 7513088 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849772E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:09:21.349184 | finish at 2025-09-10 11:49:23 + [2025-09-10 04:40:07] iteration 7338/ 11920 | consumed samples: 7514112 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843546E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:09:10.948719 | finish at 2025-09-10 11:49:18 + [2025-09-10 04:40:13] iteration 7339/ 11920 | consumed samples: 7515136 | elapsed time per iteration (ms): 5622.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844054E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:09:16.759614 | finish at 2025-09-10 11:49:30 + [2025-09-10 04:40:19] iteration 7340/ 11920 | consumed samples: 7516160 | elapsed time per iteration (ms): 5995.2 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854038E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:37:38.096433 | finish at 2025-09-10 12:17:57 + [2025-09-10 04:40:25] iteration 7341/ 11920 | consumed samples: 7517184 | elapsed time per iteration (ms): 5945.8 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850031E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:33:45.597975 | finish at 2025-09-10 12:14:10 + [2025-09-10 04:40:30] iteration 7342/ 11920 | consumed samples: 7518208 | elapsed time per iteration (ms): 5623.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852753E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:09:02.551994 | finish at 2025-09-10 11:49:33 + [2025-09-10 04:40:36] iteration 7343/ 11920 | consumed samples: 7519232 | elapsed time per iteration (ms): 5632.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840807E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:09:38.479018 | finish at 2025-09-10 11:50:14 + [2025-09-10 04:40:42] iteration 7344/ 11920 | consumed samples: 7520256 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.853956E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:09:04.834236 | finish at 2025-09-10 11:49:46 + [2025-09-10 04:40:47] iteration 7345/ 11920 | consumed samples: 7521280 | elapsed time per iteration (ms): 5624.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846632E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:08:52.413805 | finish at 2025-09-10 11:49:40 + [2025-09-10 04:40:53] iteration 7346/ 11920 | consumed samples: 7522304 | elapsed time per iteration (ms): 5847.8 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838468E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:25:47.754395 | finish at 2025-09-10 12:06:41 + [2025-09-10 04:40:59] iteration 7347/ 11920 | consumed samples: 7523328 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850423E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:08:27.041069 | finish at 2025-09-10 11:49:26 + [2025-09-10 04:41:04] iteration 7348/ 11920 | consumed samples: 7524352 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843653E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:08:18.670481 | finish at 2025-09-10 11:49:23 + [2025-09-10 04:41:10] iteration 7349/ 11920 | consumed samples: 7525376 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846867E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:08:28.716727 | finish at 2025-09-10 11:49:39 + [2025-09-10 04:41:16] iteration 7350/ 11920 | consumed samples: 7526400 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850295E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:08:12.263153 | finish at 2025-09-10 11:49:28 + [2025-09-10 04:41:21] iteration 7351/ 11920 | consumed samples: 7527424 | elapsed time per iteration (ms): 5628.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860538E+00 | loss scale: 1.0 | grad norm: 0.248 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:08:36.002046 | finish at 2025-09-10 11:49:57 + [2025-09-10 04:41:27] iteration 7352/ 11920 | consumed samples: 7528448 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859650E+00 | loss scale: 1.0 | grad norm: 0.248 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:08:14.558916 | finish at 2025-09-10 11:49:41 + [2025-09-10 04:41:32] iteration 7353/ 11920 | consumed samples: 7529472 | elapsed time per iteration (ms): 5632.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851559E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:08:44.073627 | finish at 2025-09-10 11:50:16 + [2025-09-10 04:41:38] iteration 7354/ 11920 | consumed samples: 7530496 | elapsed time per iteration (ms): 5628.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.865762E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:08:21.556545 | finish at 2025-09-10 11:50:00 + [2025-09-10 04:41:44] iteration 7355/ 11920 | consumed samples: 7531520 | elapsed time per iteration (ms): 5630.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.865407E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:08:23.046744 | finish at 2025-09-10 11:50:07 + [2025-09-10 04:41:49] iteration 7356/ 11920 | consumed samples: 7532544 | elapsed time per iteration (ms): 5627.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849272E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:08:03.422773 | finish at 2025-09-10 11:49:53 + [2025-09-10 04:41:55] iteration 7357/ 11920 | consumed samples: 7533568 | elapsed time per iteration (ms): 5616.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858432E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:07:08.351232 | finish at 2025-09-10 11:49:03 + [2025-09-10 04:42:01] iteration 7358/ 11920 | consumed samples: 7534592 | elapsed time per iteration (ms): 5617.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841887E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:07:08.775569 | finish at 2025-09-10 11:49:09 + [2025-09-10 04:42:06] iteration 7359/ 11920 | consumed samples: 7535616 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.853990E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:07:12.113738 | finish at 2025-09-10 11:49:18 + [2025-09-10 04:42:12] iteration 7360/ 11920 | consumed samples: 7536640 | elapsed time per iteration (ms): 5626.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854594E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:07:38.361568 | finish at 2025-09-10 11:49:50 + [2025-09-10 04:42:17] iteration 7361/ 11920 | consumed samples: 7537664 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857100E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:07:25.260864 | finish at 2025-09-10 11:49:43 + [2025-09-10 04:42:23] iteration 7362/ 11920 | consumed samples: 7538688 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860437E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 11.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:06:56.912525 | finish at 2025-09-10 11:49:20 + [2025-09-10 04:42:29] iteration 7363/ 11920 | consumed samples: 7539712 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841647E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:06:55.426347 | finish at 2025-09-10 11:49:24 + [2025-09-10 04:42:34] iteration 7364/ 11920 | consumed samples: 7540736 | elapsed time per iteration (ms): 5835.0 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852542E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:23:04.149378 | finish at 2025-09-10 12:05:39 + [2025-09-10 04:42:40] iteration 7365/ 11920 | consumed samples: 7541760 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823767E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:06:39.361204 | finish at 2025-09-10 11:49:19 + [2025-09-10 04:42:46] iteration 7366/ 11920 | consumed samples: 7542784 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.853941E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:06:42.186174 | finish at 2025-09-10 11:49:28 + [2025-09-10 04:42:51] iteration 7367/ 11920 | consumed samples: 7543808 | elapsed time per iteration (ms): 5614.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846469E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:06:03.208410 | finish at 2025-09-10 11:48:55 + [2025-09-10 04:42:57] iteration 7368/ 11920 | consumed samples: 7544832 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841737E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:06:47.423433 | finish at 2025-09-10 11:49:44 + [2025-09-10 04:43:03] iteration 7369/ 11920 | consumed samples: 7545856 | elapsed time per iteration (ms): 5619.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855708E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:06:12.812063 | finish at 2025-09-10 11:49:15 + [2025-09-10 04:43:08] iteration 7370/ 11920 | consumed samples: 7546880 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850533E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:06:36.266747 | finish at 2025-09-10 11:49:44 + [2025-09-10 04:43:14] iteration 7371/ 11920 | consumed samples: 7547904 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.853018E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:06:24.551355 | finish at 2025-09-10 11:49:38 + [2025-09-10 04:43:19] iteration 7372/ 11920 | consumed samples: 7548928 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.847545E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:06:15.783674 | finish at 2025-09-10 11:49:35 + [2025-09-10 04:43:25] iteration 7373/ 11920 | consumed samples: 7549952 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845893E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:06:00.184361 | finish at 2025-09-10 11:49:25 + [2025-09-10 04:43:31] iteration 7374/ 11920 | consumed samples: 7550976 | elapsed time per iteration (ms): 5630.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851843E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:06:35.901104 | finish at 2025-09-10 11:50:07 + [2025-09-10 04:43:36] iteration 7375/ 11920 | consumed samples: 7552000 | elapsed time per iteration (ms): 5632.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856527E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:06:40.654939 | finish at 2025-09-10 11:50:17 + [2025-09-10 04:43:42] iteration 7376/ 11920 | consumed samples: 7553024 | elapsed time per iteration (ms): 5626.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857261E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:06:07.566284 | finish at 2025-09-10 11:49:50 + [2025-09-10 04:43:48] iteration 7377/ 11920 | consumed samples: 7554048 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846279E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:06:00.137281 | finish at 2025-09-10 11:49:48 + [2025-09-10 04:43:53] iteration 7378/ 11920 | consumed samples: 7555072 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849592E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:05:34.618192 | finish at 2025-09-10 11:49:28 + [2025-09-10 04:43:59] iteration 7379/ 11920 | consumed samples: 7556096 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849207E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:05:33.429791 | finish at 2025-09-10 11:49:32 + [2025-09-10 04:44:05] iteration 7380/ 11920 | consumed samples: 7557120 | elapsed time per iteration (ms): 5920.4 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850219E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:27:58.433719 | finish at 2025-09-10 12:12:03 + [2025-09-10 04:44:10] iteration 7381/ 11920 | consumed samples: 7558144 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841651E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:05:12.846995 | finish at 2025-09-10 11:49:23 + [2025-09-10 04:44:16] iteration 7382/ 11920 | consumed samples: 7559168 | elapsed time per iteration (ms): 5618.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851800E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:04:55.295596 | finish at 2025-09-10 11:49:11 + [2025-09-10 04:44:22] iteration 7383/ 11920 | consumed samples: 7560192 | elapsed time per iteration (ms): 5619.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855335E+00 | loss scale: 1.0 | grad norm: 0.248 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:04:53.452568 | finish at 2025-09-10 11:49:15 + [2025-09-10 04:44:27] iteration 7384/ 11920 | consumed samples: 7561216 | elapsed time per iteration (ms): 5629.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.853059E+00 | loss scale: 1.0 | grad norm: 0.272 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:05:36.155651 | finish at 2025-09-10 11:50:03 + [2025-09-10 04:44:33] iteration 7385/ 11920 | consumed samples: 7562240 | elapsed time per iteration (ms): 5632.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840698E+00 | loss scale: 1.0 | grad norm: 0.290 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:05:42.076749 | finish at 2025-09-10 11:50:15 + [2025-09-10 04:44:39] iteration 7386/ 11920 | consumed samples: 7563264 | elapsed time per iteration (ms): 5876.1 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862306E+00 | loss scale: 1.0 | grad norm: 0.244 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:24:02.449561 | finish at 2025-09-10 12:08:41 + [2025-09-10 04:44:44] iteration 7387/ 11920 | consumed samples: 7564288 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.853358E+00 | loss scale: 1.0 | grad norm: 0.257 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:04:54.925976 | finish at 2025-09-10 11:49:39 + [2025-09-10 04:44:50] iteration 7388/ 11920 | consumed samples: 7565312 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.847426E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:04:40.154058 | finish at 2025-09-10 11:49:30 + [2025-09-10 04:44:56] iteration 7389/ 11920 | consumed samples: 7566336 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850237E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:04:36.986166 | finish at 2025-09-10 11:49:33 + [2025-09-10 04:45:01] iteration 7390/ 11920 | consumed samples: 7567360 | elapsed time per iteration (ms): 5626.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.853087E+00 | loss scale: 1.0 | grad norm: 0.241 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:04:48.711970 | finish at 2025-09-10 11:49:50 + [2025-09-10 04:45:07] iteration 7391/ 11920 | consumed samples: 7568384 | elapsed time per iteration (ms): 5618.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856064E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:04:06.986604 | finish at 2025-09-10 11:49:14 + [2025-09-10 04:45:13] iteration 7392/ 11920 | consumed samples: 7569408 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849676E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:04:33.226803 | finish at 2025-09-10 11:49:46 + [2025-09-10 04:45:18] iteration 7393/ 11920 | consumed samples: 7570432 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852624E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:04:05.911057 | finish at 2025-09-10 11:49:24 + [2025-09-10 04:45:24] iteration 7394/ 11920 | consumed samples: 7571456 | elapsed time per iteration (ms): 5618.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846860E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:03:49.124869 | finish at 2025-09-10 11:49:13 + [2025-09-10 04:45:30] iteration 7395/ 11920 | consumed samples: 7572480 | elapsed time per iteration (ms): 5967.6 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.872788E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:30:03.327912 | finish at 2025-09-10 12:15:33 + [2025-09-10 04:45:35] iteration 7396/ 11920 | consumed samples: 7573504 | elapsed time per iteration (ms): 5632.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854285E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:04:41.179461 | finish at 2025-09-10 11:50:17 + [2025-09-10 04:45:41] iteration 7397/ 11920 | consumed samples: 7574528 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849898E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:03:46.274261 | finish at 2025-09-10 11:49:27 + [2025-09-10 04:45:47] iteration 7398/ 11920 | consumed samples: 7575552 | elapsed time per iteration (ms): 5616.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855565E+00 | loss scale: 1.0 | grad norm: 0.130 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:03:19.213041 | finish at 2025-09-10 11:49:06 + [2025-09-10 04:45:52] iteration 7399/ 11920 | consumed samples: 7576576 | elapsed time per iteration (ms): 5804.5 | throughput per GPU (TFLOP/s/GPU): 77.8 | MFU 7.86% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846455E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:17:22.251677 | finish at 2025-09-10 12:03:15 + [2025-09-10 04:45:58] iteration 7400/ 11920 | consumed samples: 7577600 | elapsed time per iteration (ms): 5632.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845284E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:04:17.389908 | finish at 2025-09-10 11:50:15 + [2025-09-10 04:46:04] iteration 7401/ 11920 | consumed samples: 7578624 | elapsed time per iteration (ms): 5625.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838777E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:03:41.381144 | finish at 2025-09-10 11:49:45 + [2025-09-10 04:46:09] iteration 7402/ 11920 | consumed samples: 7579648 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.865005E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 11.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:03:16.725247 | finish at 2025-09-10 11:49:26 + [2025-09-10 04:46:15] iteration 7403/ 11920 | consumed samples: 7580672 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852018E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:03:10.782011 | finish at 2025-09-10 11:49:26 + [2025-09-10 04:46:21] iteration 7404/ 11920 | consumed samples: 7581696 | elapsed time per iteration (ms): 5951.6 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845763E+00 | loss scale: 1.0 | grad norm: 0.133 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:27:57.250436 | finish at 2025-09-10 12:14:18 + [2025-09-10 04:46:27] iteration 7405/ 11920 | consumed samples: 7582720 | elapsed time per iteration (ms): 5835.9 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852541E+00 | loss scale: 1.0 | grad norm: 0.137 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:19:08.983315 | finish at 2025-09-10 12:05:36 + [2025-09-10 04:46:32] iteration 7406/ 11920 | consumed samples: 7583744 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837534E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:03:13.802797 | finish at 2025-09-10 11:49:46 + [2025-09-10 04:46:38] iteration 7407/ 11920 | consumed samples: 7584768 | elapsed time per iteration (ms): 5637.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.847786E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:04:03.723781 | finish at 2025-09-10 11:50:42 + [2025-09-10 04:46:44] iteration 7408/ 11920 | consumed samples: 7585792 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839769E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:02:43.174278 | finish at 2025-09-10 11:49:27 + [2025-09-10 04:46:49] iteration 7409/ 11920 | consumed samples: 7586816 | elapsed time per iteration (ms): 5619.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827228E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:02:27.602424 | finish at 2025-09-10 11:49:17 + [2025-09-10 04:46:55] iteration 7410/ 11920 | consumed samples: 7587840 | elapsed time per iteration (ms): 5630.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.847018E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:03:14.462879 | finish at 2025-09-10 11:50:09 + [2025-09-10 04:47:00] iteration 7411/ 11920 | consumed samples: 7588864 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855287E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:02:53.278652 | finish at 2025-09-10 11:49:54 + [2025-09-10 04:47:06] iteration 7412/ 11920 | consumed samples: 7589888 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859858E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:02:31.194201 | finish at 2025-09-10 11:49:37 + [2025-09-10 04:47:12] iteration 7413/ 11920 | consumed samples: 7590912 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848258E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:02:17.828449 | finish at 2025-09-10 11:49:30 + [2025-09-10 04:47:17] iteration 7414/ 11920 | consumed samples: 7591936 | elapsed time per iteration (ms): 5619.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842668E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:01:59.311574 | finish at 2025-09-10 11:49:17 + [2025-09-10 04:47:23] iteration 7415/ 11920 | consumed samples: 7592960 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846232E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:02:14.505991 | finish at 2025-09-10 11:49:37 + [2025-09-10 04:47:29] iteration 7416/ 11920 | consumed samples: 7593984 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862027E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:01:56.394695 | finish at 2025-09-10 11:49:25 + [2025-09-10 04:47:34] iteration 7417/ 11920 | consumed samples: 7595008 | elapsed time per iteration (ms): 5631.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.847375E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:02:38.089475 | finish at 2025-09-10 11:50:12 + [2025-09-10 04:47:40] iteration 7418/ 11920 | consumed samples: 7596032 | elapsed time per iteration (ms): 5628.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.847353E+00 | loss scale: 1.0 | grad norm: 0.251 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:02:18.283300 | finish at 2025-09-10 11:49:58 + [2025-09-10 04:47:45] iteration 7419/ 11920 | consumed samples: 7597056 | elapsed time per iteration (ms): 5613.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.853967E+00 | loss scale: 1.0 | grad norm: 0.268 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:01:05.961612 | finish at 2025-09-10 11:48:51 + [2025-09-10 04:47:51] iteration 7420/ 11920 | consumed samples: 7598080 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856467E+00 | loss scale: 1.0 | grad norm: 0.251 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:01:58.239927 | finish at 2025-09-10 11:49:49 + [2025-09-10 04:47:57] iteration 7421/ 11920 | consumed samples: 7599104 | elapsed time per iteration (ms): 5635.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858044E+00 | loss scale: 1.0 | grad norm: 0.238 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:02:31.952914 | finish at 2025-09-10 11:50:29 + [2025-09-10 04:48:02] iteration 7422/ 11920 | consumed samples: 7600128 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842539E+00 | loss scale: 1.0 | grad norm: 0.247 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:01:31.553298 | finish at 2025-09-10 11:49:34 + [2025-09-10 04:48:08] iteration 7423/ 11920 | consumed samples: 7601152 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850174E+00 | loss scale: 1.0 | grad norm: 0.247 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:01:23.506281 | finish at 2025-09-10 11:49:31 + [2025-09-10 04:48:14] iteration 7424/ 11920 | consumed samples: 7602176 | elapsed time per iteration (ms): 5976.2 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855933E+00 | loss scale: 1.0 | grad norm: 0.247 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:27:49.210052 | finish at 2025-09-10 12:16:03 + [2025-09-10 04:48:20] iteration 7425/ 11920 | consumed samples: 7603200 | elapsed time per iteration (ms): 5889.5 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845103E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:21:13.210969 | finish at 2025-09-10 12:09:33 + [2025-09-10 04:48:25] iteration 7426/ 11920 | consumed samples: 7604224 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858460E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:01:24.564776 | finish at 2025-09-10 11:49:50 + [2025-09-10 04:48:32] iteration 7427/ 11920 | consumed samples: 7605248 | elapsed time per iteration (ms): 6247.9 | throughput per GPU (TFLOP/s/GPU): 72.3 | MFU 7.31% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856241E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:47:51.929361 | finish at 2025-09-10 12:36:24 + [2025-09-10 04:48:38] iteration 7428/ 11920 | consumed samples: 7606272 | elapsed time per iteration (ms): 5975.5 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856951E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:27:21.892931 | finish at 2025-09-10 12:16:00 + [2025-09-10 04:48:44] iteration 7429/ 11920 | consumed samples: 7607296 | elapsed time per iteration (ms): 6213.7 | throughput per GPU (TFLOP/s/GPU): 72.7 | MFU 7.35% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843797E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:45:05.562058 | finish at 2025-09-10 12:33:49 + [2025-09-10 04:48:51] iteration 7430/ 11920 | consumed samples: 7608320 | elapsed time per iteration (ms): 6639.6 | throughput per GPU (TFLOP/s/GPU): 68.0 | MFU 6.88% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842420E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 8:16:51.757071 | finish at 2025-09-10 13:05:42 + [2025-09-10 04:48:56] iteration 7431/ 11920 | consumed samples: 7609344 | elapsed time per iteration (ms): 5631.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.853043E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:01:21.000077 | finish at 2025-09-10 11:50:17 + [2025-09-10 04:49:02] iteration 7432/ 11920 | consumed samples: 7610368 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838538E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:00:42.897406 | finish at 2025-09-10 11:49:45 + [2025-09-10 04:49:07] iteration 7433/ 11920 | consumed samples: 7611392 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846729E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:00:25.656088 | finish at 2025-09-10 11:49:33 + [2025-09-10 04:49:13] iteration 7434/ 11920 | consumed samples: 7612416 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842660E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:00:31.460103 | finish at 2025-09-10 11:49:44 + [2025-09-10 04:49:19] iteration 7435/ 11920 | consumed samples: 7613440 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845386E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:00:14.153430 | finish at 2025-09-10 11:49:33 + [2025-09-10 04:49:25] iteration 7436/ 11920 | consumed samples: 7614464 | elapsed time per iteration (ms): 5918.8 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.875916E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:22:20.087874 | finish at 2025-09-10 12:11:45 + [2025-09-10 04:49:30] iteration 7437/ 11920 | consumed samples: 7615488 | elapsed time per iteration (ms): 5628.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867749E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:00:30.782622 | finish at 2025-09-10 11:50:01 + [2025-09-10 04:49:36] iteration 7438/ 11920 | consumed samples: 7616512 | elapsed time per iteration (ms): 5637.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857388E+00 | loss scale: 1.0 | grad norm: 0.285 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:01:06.782593 | finish at 2025-09-10 11:50:43 + [2025-09-10 04:49:41] iteration 7439/ 11920 | consumed samples: 7617536 | elapsed time per iteration (ms): 5635.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843859E+00 | loss scale: 1.0 | grad norm: 0.285 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:00:51.984070 | finish at 2025-09-10 11:50:33 + [2025-09-10 04:49:47] iteration 7440/ 11920 | consumed samples: 7618560 | elapsed time per iteration (ms): 5982.3 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846583E+00 | loss scale: 1.0 | grad norm: 0.265 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:26:40.508728 | finish at 2025-09-10 12:16:28 + [2025-09-10 04:49:53] iteration 7441/ 11920 | consumed samples: 7619584 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.871775E+00 | loss scale: 1.0 | grad norm: 0.272 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:59:41.004114 | finish at 2025-09-10 11:49:34 + [2025-09-10 04:49:59] iteration 7442/ 11920 | consumed samples: 7620608 | elapsed time per iteration (ms): 6283.8 | throughput per GPU (TFLOP/s/GPU): 71.8 | MFU 7.26% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.863362E+00 | loss scale: 1.0 | grad norm: 0.264 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:48:58.730979 | finish at 2025-09-10 12:38:58 + [2025-09-10 04:50:05] iteration 7443/ 11920 | consumed samples: 7621632 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861172E+00 | loss scale: 1.0 | grad norm: 0.256 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:59:30.785856 | finish at 2025-09-10 11:49:36 + [2025-09-10 04:50:11] iteration 7444/ 11920 | consumed samples: 7622656 | elapsed time per iteration (ms): 5836.4 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846959E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:15:23.664648 | finish at 2025-09-10 12:05:34 + [2025-09-10 04:50:16] iteration 7445/ 11920 | consumed samples: 7623680 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851408E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:59:25.292084 | finish at 2025-09-10 11:49:42 + [2025-09-10 04:50:22] iteration 7446/ 11920 | consumed samples: 7624704 | elapsed time per iteration (ms): 5832.1 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854950E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:14:52.915286 | finish at 2025-09-10 12:05:15 + [2025-09-10 04:50:28] iteration 7447/ 11920 | consumed samples: 7625728 | elapsed time per iteration (ms): 6151.4 | throughput per GPU (TFLOP/s/GPU): 73.4 | MFU 7.42% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850739E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:38:35.001443 | finish at 2025-09-10 12:29:03 + [2025-09-10 04:50:34] iteration 7448/ 11920 | consumed samples: 7626752 | elapsed time per iteration (ms): 5820.1 | throughput per GPU (TFLOP/s/GPU): 77.6 | MFU 7.84% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848348E+00 | loss scale: 1.0 | grad norm: 0.243 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:13:47.631447 | finish at 2025-09-10 12:04:22 + [2025-09-10 04:50:40] iteration 7449/ 11920 | consumed samples: 7627776 | elapsed time per iteration (ms): 5635.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850942E+00 | loss scale: 1.0 | grad norm: 0.266 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:59:55.358791 | finish at 2025-09-10 11:50:35 + [2025-09-10 04:50:45] iteration 7450/ 11920 | consumed samples: 7628800 | elapsed time per iteration (ms): 5627.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851458E+00 | loss scale: 1.0 | grad norm: 0.243 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:59:15.897202 | finish at 2025-09-10 11:50:01 + [2025-09-10 04:50:51] iteration 7451/ 11920 | consumed samples: 7629824 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851796E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:58:37.761305 | finish at 2025-09-10 11:49:29 + [2025-09-10 04:50:57] iteration 7452/ 11920 | consumed samples: 7630848 | elapsed time per iteration (ms): 5631.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861003E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:59:20.886894 | finish at 2025-09-10 11:50:18 + [2025-09-10 04:51:02] iteration 7453/ 11920 | consumed samples: 7631872 | elapsed time per iteration (ms): 5628.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858096E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:59:00.501877 | finish at 2025-09-10 11:50:03 + [2025-09-10 04:51:08] iteration 7454/ 11920 | consumed samples: 7632896 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852345E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:58:18.522327 | finish at 2025-09-10 11:49:27 + [2025-09-10 04:51:14] iteration 7455/ 11920 | consumed samples: 7633920 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858184E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:58:14.865426 | finish at 2025-09-10 11:49:28 + [2025-09-10 04:51:19] iteration 7456/ 11920 | consumed samples: 7634944 | elapsed time per iteration (ms): 5617.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857342E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:57:55.361275 | finish at 2025-09-10 11:49:15 + [2025-09-10 04:51:25] iteration 7457/ 11920 | consumed samples: 7635968 | elapsed time per iteration (ms): 5977.0 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.864311E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:24:35.313749 | finish at 2025-09-10 12:16:01 + [2025-09-10 04:51:31] iteration 7458/ 11920 | consumed samples: 7636992 | elapsed time per iteration (ms): 5921.5 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844912E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:20:21.697749 | finish at 2025-09-10 12:11:53 + [2025-09-10 04:51:37] iteration 7459/ 11920 | consumed samples: 7638016 | elapsed time per iteration (ms): 5631.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857766E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:58:42.041757 | finish at 2025-09-10 11:50:19 + [2025-09-10 04:51:43] iteration 7460/ 11920 | consumed samples: 7639040 | elapsed time per iteration (ms): 5831.7 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855481E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:13:29.279242 | finish at 2025-09-10 12:05:12 + [2025-09-10 04:51:48] iteration 7461/ 11920 | consumed samples: 7640064 | elapsed time per iteration (ms): 5628.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850733E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:58:19.079284 | finish at 2025-09-10 11:50:07 + [2025-09-10 04:51:54] iteration 7462/ 11920 | consumed samples: 7641088 | elapsed time per iteration (ms): 5617.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851206E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:57:20.821353 | finish at 2025-09-10 11:49:15 + [2025-09-10 04:51:59] iteration 7463/ 11920 | consumed samples: 7642112 | elapsed time per iteration (ms): 5631.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831138E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:58:17.876188 | finish at 2025-09-10 11:50:17 + [2025-09-10 04:52:05] iteration 7464/ 11920 | consumed samples: 7643136 | elapsed time per iteration (ms): 5626.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861565E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:57:52.274206 | finish at 2025-09-10 11:49:57 + [2025-09-10 04:52:11] iteration 7465/ 11920 | consumed samples: 7644160 | elapsed time per iteration (ms): 5616.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845423E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:56:59.456037 | finish at 2025-09-10 11:49:10 + [2025-09-10 04:52:16] iteration 7466/ 11920 | consumed samples: 7645184 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842115E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:57:19.756066 | finish at 2025-09-10 11:49:36 + [2025-09-10 04:52:22] iteration 7467/ 11920 | consumed samples: 7646208 | elapsed time per iteration (ms): 5619.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843834E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:57:03.124608 | finish at 2025-09-10 11:49:25 + [2025-09-10 04:52:28] iteration 7468/ 11920 | consumed samples: 7647232 | elapsed time per iteration (ms): 5632.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858258E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:57:57.616665 | finish at 2025-09-10 11:50:25 + [2025-09-10 04:52:33] iteration 7469/ 11920 | consumed samples: 7648256 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860172E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:57:12.016821 | finish at 2025-09-10 11:49:45 + [2025-09-10 04:52:39] iteration 7470/ 11920 | consumed samples: 7649280 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858999E+00 | loss scale: 1.0 | grad norm: 0.142 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:56:48.745921 | finish at 2025-09-10 11:49:28 + [2025-09-10 04:52:44] iteration 7471/ 11920 | consumed samples: 7650304 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829324E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:56:53.871114 | finish at 2025-09-10 11:49:38 + [2025-09-10 04:52:50] iteration 7472/ 11920 | consumed samples: 7651328 | elapsed time per iteration (ms): 5631.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.847449E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:57:28.948082 | finish at 2025-09-10 11:50:19 + [2025-09-10 04:52:56] iteration 7473/ 11920 | consumed samples: 7652352 | elapsed time per iteration (ms): 5630.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860934E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:57:17.278465 | finish at 2025-09-10 11:50:13 + [2025-09-10 04:53:01] iteration 7474/ 11920 | consumed samples: 7653376 | elapsed time per iteration (ms): 5627.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851739E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:56:57.663616 | finish at 2025-09-10 11:49:59 + [2025-09-10 04:53:07] iteration 7475/ 11920 | consumed samples: 7654400 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843718E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:56:43.498039 | finish at 2025-09-10 11:49:50 + [2025-09-10 04:53:13] iteration 7476/ 11920 | consumed samples: 7655424 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838390E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:56:15.011430 | finish at 2025-09-10 11:49:28 + [2025-09-10 04:53:18] iteration 7477/ 11920 | consumed samples: 7656448 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839455E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:56:11.847992 | finish at 2025-09-10 11:49:30 + [2025-09-10 04:53:24] iteration 7478/ 11920 | consumed samples: 7657472 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.847024E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:56:12.987450 | finish at 2025-09-10 11:49:37 + [2025-09-10 04:53:29] iteration 7479/ 11920 | consumed samples: 7658496 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840966E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:56:25.289088 | finish at 2025-09-10 11:49:55 + [2025-09-10 04:53:35] iteration 7480/ 11920 | consumed samples: 7659520 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840769E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:56:12.675362 | finish at 2025-09-10 11:49:48 + [2025-09-10 04:53:41] iteration 7481/ 11920 | consumed samples: 7660544 | elapsed time per iteration (ms): 5616.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859407E+00 | loss scale: 1.0 | grad norm: 0.255 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:55:31.740427 | finish at 2025-09-10 11:49:12 + [2025-09-10 04:53:46] iteration 7482/ 11920 | consumed samples: 7661568 | elapsed time per iteration (ms): 5627.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839404E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:56:15.861032 | finish at 2025-09-10 11:50:02 + [2025-09-10 04:53:52] iteration 7483/ 11920 | consumed samples: 7662592 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861357E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:56:00.662814 | finish at 2025-09-10 11:49:53 + [2025-09-10 04:53:58] iteration 7484/ 11920 | consumed samples: 7663616 | elapsed time per iteration (ms): 5627.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857776E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:56:01.209540 | finish at 2025-09-10 11:49:59 + [2025-09-10 04:54:03] iteration 7485/ 11920 | consumed samples: 7664640 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844388E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:55:30.929613 | finish at 2025-09-10 11:49:34 + [2025-09-10 04:54:09] iteration 7486/ 11920 | consumed samples: 7665664 | elapsed time per iteration (ms): 5618.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852453E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:55:10.062020 | finish at 2025-09-10 11:49:19 + [2025-09-10 04:54:14] iteration 7487/ 11920 | consumed samples: 7666688 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861922E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:55:11.361527 | finish at 2025-09-10 11:49:26 + [2025-09-10 04:54:20] iteration 7488/ 11920 | consumed samples: 7667712 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.847371E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:55:27.253712 | finish at 2025-09-10 11:49:47 + [2025-09-10 04:54:26] iteration 7489/ 11920 | consumed samples: 7668736 | elapsed time per iteration (ms): 5934.0 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844941E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:18:13.448424 | finish at 2025-09-10 12:12:39 + [2025-09-10 04:54:32] iteration 7490/ 11920 | consumed samples: 7669760 | elapsed time per iteration (ms): 5935.1 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860500E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:18:12.661283 | finish at 2025-09-10 12:12:45 + [2025-09-10 04:54:38] iteration 7491/ 11920 | consumed samples: 7670784 | elapsed time per iteration (ms): 5990.0 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851728E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:22:09.852596 | finish at 2025-09-10 12:16:48 + [2025-09-10 04:54:44] iteration 7492/ 11920 | consumed samples: 7671808 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851407E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:55:03.107160 | finish at 2025-09-10 11:49:47 + [2025-09-10 04:54:49] iteration 7493/ 11920 | consumed samples: 7672832 | elapsed time per iteration (ms): 5630.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848852E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:55:24.668070 | finish at 2025-09-10 11:50:14 + [2025-09-10 04:54:55] iteration 7494/ 11920 | consumed samples: 7673856 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.847782E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:54:34.817008 | finish at 2025-09-10 11:49:30 + [2025-09-10 04:55:00] iteration 7495/ 11920 | consumed samples: 7674880 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824265E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:54:54.258428 | finish at 2025-09-10 11:49:55 + [2025-09-10 04:55:06] iteration 7496/ 11920 | consumed samples: 7675904 | elapsed time per iteration (ms): 5655.6 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.847956E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:57:00.485464 | finish at 2025-09-10 11:52:07 + [2025-09-10 04:55:12] iteration 7497/ 11920 | consumed samples: 7676928 | elapsed time per iteration (ms): 5867.8 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843741E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:12:33.096955 | finish at 2025-09-10 12:07:45 + [2025-09-10 04:55:18] iteration 7498/ 11920 | consumed samples: 7677952 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849972E+00 | loss scale: 1.0 | grad norm: 0.245 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:54:28.065285 | finish at 2025-09-10 11:49:46 + [2025-09-10 04:55:23] iteration 7499/ 11920 | consumed samples: 7678976 | elapsed time per iteration (ms): 5619.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855349E+00 | loss scale: 1.0 | grad norm: 0.252 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:54:01.848624 | finish at 2025-09-10 11:49:25 + [2025-09-10 04:55:29] iteration 7500/ 11920 | consumed samples: 7680000 | elapsed time per iteration (ms): 5619.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.853683E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:53:58.426762 | finish at 2025-09-10 11:49:27 + [2025-09-10 04:55:34] iteration 7501/ 11920 | consumed samples: 7681024 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856694E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:53:52.760851 | finish at 2025-09-10 11:49:27 + [2025-09-10 04:55:40] iteration 7502/ 11920 | consumed samples: 7682048 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.853536E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:53:57.134281 | finish at 2025-09-10 11:49:37 + [2025-09-10 04:55:46] iteration 7503/ 11920 | consumed samples: 7683072 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850765E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:53:58.200681 | finish at 2025-09-10 11:49:44 + [2025-09-10 04:55:51] iteration 7504/ 11920 | consumed samples: 7684096 | elapsed time per iteration (ms): 5632.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835288E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:54:34.691620 | finish at 2025-09-10 11:50:26 + [2025-09-10 04:55:57] iteration 7505/ 11920 | consumed samples: 7685120 | elapsed time per iteration (ms): 5629.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.868987E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:54:14.032642 | finish at 2025-09-10 11:50:11 + [2025-09-10 04:56:03] iteration 7506/ 11920 | consumed samples: 7686144 | elapsed time per iteration (ms): 5626.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852435E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:53:56.717566 | finish at 2025-09-10 11:49:59 + [2025-09-10 04:56:09] iteration 7507/ 11920 | consumed samples: 7687168 | elapsed time per iteration (ms): 6269.3 | throughput per GPU (TFLOP/s/GPU): 72.0 | MFU 7.28% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858682E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:41:06.420829 | finish at 2025-09-10 12:37:15 + [2025-09-10 04:56:14] iteration 7508/ 11920 | consumed samples: 7688192 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844842E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:53:21.912905 | finish at 2025-09-10 11:49:36 + [2025-09-10 04:56:20] iteration 7509/ 11920 | consumed samples: 7689216 | elapsed time per iteration (ms): 5618.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842906E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:53:02.529358 | finish at 2025-09-10 11:49:23 + [2025-09-10 04:56:26] iteration 7510/ 11920 | consumed samples: 7690240 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848173E+00 | loss scale: 1.0 | grad norm: 0.130 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:53:06.921637 | finish at 2025-09-10 11:49:33 + [2025-09-10 04:56:31] iteration 7511/ 11920 | consumed samples: 7691264 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848674E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:53:31.754973 | finish at 2025-09-10 11:50:03 + [2025-09-10 04:56:37] iteration 7512/ 11920 | consumed samples: 7692288 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842434E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:52:59.478533 | finish at 2025-09-10 11:49:36 + [2025-09-10 04:56:43] iteration 7513/ 11920 | consumed samples: 7693312 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849959E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:52:50.460106 | finish at 2025-09-10 11:49:33 + [2025-09-10 04:56:48] iteration 7514/ 11920 | consumed samples: 7694336 | elapsed time per iteration (ms): 5630.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851681E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:53:29.800662 | finish at 2025-09-10 11:50:18 + [2025-09-10 04:56:54] iteration 7515/ 11920 | consumed samples: 7695360 | elapsed time per iteration (ms): 5629.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851732E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:53:17.485011 | finish at 2025-09-10 11:50:11 + [2025-09-10 04:56:59] iteration 7516/ 11920 | consumed samples: 7696384 | elapsed time per iteration (ms): 5631.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840029E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:53:20.427778 | finish at 2025-09-10 11:50:20 + [2025-09-10 04:57:05] iteration 7517/ 11920 | consumed samples: 7697408 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841808E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:52:27.777821 | finish at 2025-09-10 11:49:33 + [2025-09-10 04:57:11] iteration 7518/ 11920 | consumed samples: 7698432 | elapsed time per iteration (ms): 5621.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836074E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:52:23.627533 | finish at 2025-09-10 11:49:34 + [2025-09-10 04:57:16] iteration 7519/ 11920 | consumed samples: 7699456 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.864688E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:52:35.622901 | finish at 2025-09-10 11:49:52 + [2025-09-10 04:57:22] iteration 7520/ 11920 | consumed samples: 7700480 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.853188E+00 | loss scale: 1.0 | grad norm: 0.252 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:52:29.155521 | finish at 2025-09-10 11:49:51 + [2025-09-10 04:57:28] iteration 7521/ 11920 | consumed samples: 7701504 | elapsed time per iteration (ms): 5839.1 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852969E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:08:06.278001 | finish at 2025-09-10 12:05:34 + [2025-09-10 04:57:33] iteration 7522/ 11920 | consumed samples: 7702528 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849928E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:52:06.701989 | finish at 2025-09-10 11:49:40 + [2025-09-10 04:57:39] iteration 7523/ 11920 | consumed samples: 7703552 | elapsed time per iteration (ms): 5622.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852785E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:52:02.766486 | finish at 2025-09-10 11:49:42 + [2025-09-10 04:57:45] iteration 7524/ 11920 | consumed samples: 7704576 | elapsed time per iteration (ms): 5851.3 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840539E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:08:42.302228 | finish at 2025-09-10 12:06:27 + [2025-09-10 04:57:51] iteration 7525/ 11920 | consumed samples: 7705600 | elapsed time per iteration (ms): 5632.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828539E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:52:34.644402 | finish at 2025-09-10 11:50:25 + [2025-09-10 04:57:56] iteration 7526/ 11920 | consumed samples: 7706624 | elapsed time per iteration (ms): 5622.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.866388E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:51:45.135892 | finish at 2025-09-10 11:49:41 + [2025-09-10 04:58:02] iteration 7527/ 11920 | consumed samples: 7707648 | elapsed time per iteration (ms): 5637.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852825E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:52:44.953275 | finish at 2025-09-10 11:50:47 + [2025-09-10 04:58:07] iteration 7528/ 11920 | consumed samples: 7708672 | elapsed time per iteration (ms): 5618.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.853051E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:51:18.321110 | finish at 2025-09-10 11:49:26 + [2025-09-10 04:58:13] iteration 7529/ 11920 | consumed samples: 7709696 | elapsed time per iteration (ms): 5619.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846276E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:51:15.893123 | finish at 2025-09-10 11:49:29 + [2025-09-10 04:58:19] iteration 7530/ 11920 | consumed samples: 7710720 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850204E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:51:26.309311 | finish at 2025-09-10 11:49:45 + [2025-09-10 04:58:24] iteration 7531/ 11920 | consumed samples: 7711744 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862290E+00 | loss scale: 1.0 | grad norm: 0.244 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:51:13.074353 | finish at 2025-09-10 11:49:37 + [2025-09-10 04:58:30] iteration 7532/ 11920 | consumed samples: 7712768 | elapsed time per iteration (ms): 5624.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844116E+00 | loss scale: 1.0 | grad norm: 0.255 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:51:20.742416 | finish at 2025-09-10 11:49:51 + [2025-09-10 04:58:36] iteration 7533/ 11920 | consumed samples: 7713792 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856240E+00 | loss scale: 1.0 | grad norm: 0.271 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:51:15.705637 | finish at 2025-09-10 11:49:51 + [2025-09-10 04:58:41] iteration 7534/ 11920 | consumed samples: 7714816 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855226E+00 | loss scale: 1.0 | grad norm: 0.294 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:51:06.695960 | finish at 2025-09-10 11:49:48 + [2025-09-10 04:58:47] iteration 7535/ 11920 | consumed samples: 7715840 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.868248E+00 | loss scale: 1.0 | grad norm: 0.258 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:51:00.738494 | finish at 2025-09-10 11:49:48 + [2025-09-10 04:58:52] iteration 7536/ 11920 | consumed samples: 7716864 | elapsed time per iteration (ms): 5631.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836082E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:51:29.136749 | finish at 2025-09-10 11:50:22 + [2025-09-10 04:58:58] iteration 7537/ 11920 | consumed samples: 7717888 | elapsed time per iteration (ms): 5628.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850415E+00 | loss scale: 1.0 | grad norm: 0.269 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:51:07.512597 | finish at 2025-09-10 11:50:06 + [2025-09-10 04:59:04] iteration 7538/ 11920 | consumed samples: 7718912 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844322E+00 | loss scale: 1.0 | grad norm: 0.252 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:50:32.573088 | finish at 2025-09-10 11:49:36 + [2025-09-10 04:59:09] iteration 7539/ 11920 | consumed samples: 7719936 | elapsed time per iteration (ms): 5619.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850610E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:50:18.576884 | finish at 2025-09-10 11:49:28 + [2025-09-10 04:59:15] iteration 7540/ 11920 | consumed samples: 7720960 | elapsed time per iteration (ms): 5942.2 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.847075E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:13:46.944366 | finish at 2025-09-10 12:13:02 + [2025-09-10 04:59:21] iteration 7541/ 11920 | consumed samples: 7721984 | elapsed time per iteration (ms): 5969.1 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.847880E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:15:38.476954 | finish at 2025-09-10 12:15:00 + [2025-09-10 04:59:27] iteration 7542/ 11920 | consumed samples: 7723008 | elapsed time per iteration (ms): 6135.8 | throughput per GPU (TFLOP/s/GPU): 73.6 | MFU 7.44% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842452E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:27:42.400377 | finish at 2025-09-10 12:27:10 + [2025-09-10 04:59:33] iteration 7543/ 11920 | consumed samples: 7724032 | elapsed time per iteration (ms): 5884.9 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849922E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:09:18.243188 | finish at 2025-09-10 12:08:51 + [2025-09-10 04:59:39] iteration 7544/ 11920 | consumed samples: 7725056 | elapsed time per iteration (ms): 5617.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841934E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:49:40.689388 | finish at 2025-09-10 11:49:20 + [2025-09-10 04:59:45] iteration 7545/ 11920 | consumed samples: 7726080 | elapsed time per iteration (ms): 5879.0 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844541E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:08:40.750988 | finish at 2025-09-10 12:08:25 + [2025-09-10 04:59:51] iteration 7546/ 11920 | consumed samples: 7727104 | elapsed time per iteration (ms): 6039.0 | throughput per GPU (TFLOP/s/GPU): 74.8 | MFU 7.56% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855536E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:20:14.430767 | finish at 2025-09-10 12:20:05 + [2025-09-10 04:59:56] iteration 7547/ 11920 | consumed samples: 7728128 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843165E+00 | loss scale: 1.0 | grad norm: 0.134 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:49:55.527872 | finish at 2025-09-10 11:49:52 + [2025-09-10 05:00:02] iteration 7548/ 11920 | consumed samples: 7729152 | elapsed time per iteration (ms): 5632.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840912E+00 | loss scale: 1.0 | grad norm: 0.126 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:50:23.786616 | finish at 2025-09-10 11:50:26 + [2025-09-10 05:00:08] iteration 7549/ 11920 | consumed samples: 7730176 | elapsed time per iteration (ms): 5633.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842893E+00 | loss scale: 1.0 | grad norm: 0.137 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:50:24.492680 | finish at 2025-09-10 11:50:32 + [2025-09-10 05:00:13] iteration 7550/ 11920 | consumed samples: 7731200 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836962E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:49:44.591339 | finish at 2025-09-10 11:49:58 + [2025-09-10 05:00:19] iteration 7551/ 11920 | consumed samples: 7732224 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848394E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:49:14.439907 | finish at 2025-09-10 11:49:33 + [2025-09-10 05:00:25] iteration 7552/ 11920 | consumed samples: 7733248 | elapsed time per iteration (ms): 5818.2 | throughput per GPU (TFLOP/s/GPU): 77.6 | MFU 7.85% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828152E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:03:34.057423 | finish at 2025-09-10 12:03:59 + [2025-09-10 05:00:31] iteration 7553/ 11920 | consumed samples: 7734272 | elapsed time per iteration (ms): 5919.6 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848150E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:10:50.751644 | finish at 2025-09-10 12:11:21 + [2025-09-10 05:00:37] iteration 7554/ 11920 | consumed samples: 7735296 | elapsed time per iteration (ms): 6309.9 | throughput per GPU (TFLOP/s/GPU): 71.6 | MFU 7.23% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843013E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:39:08.851844 | finish at 2025-09-10 12:39:46 + [2025-09-10 05:00:43] iteration 7555/ 11920 | consumed samples: 7736320 | elapsed time per iteration (ms): 5932.0 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836273E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:11:33.052694 | finish at 2025-09-10 12:12:16 + [2025-09-10 05:00:48] iteration 7556/ 11920 | consumed samples: 7737344 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852884E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:49:04.338046 | finish at 2025-09-10 11:49:53 + [2025-09-10 05:00:54] iteration 7557/ 11920 | consumed samples: 7738368 | elapsed time per iteration (ms): 5623.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858245E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:48:53.568841 | finish at 2025-09-10 11:49:48 + [2025-09-10 05:01:00] iteration 7558/ 11920 | consumed samples: 7739392 | elapsed time per iteration (ms): 5633.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861476E+00 | loss scale: 1.0 | grad norm: 0.244 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:49:31.153871 | finish at 2025-09-10 11:50:31 + [2025-09-10 05:01:05] iteration 7559/ 11920 | consumed samples: 7740416 | elapsed time per iteration (ms): 5627.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857486E+00 | loss scale: 1.0 | grad norm: 0.245 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:48:59.614622 | finish at 2025-09-10 11:50:05 + [2025-09-10 05:01:11] iteration 7560/ 11920 | consumed samples: 7741440 | elapsed time per iteration (ms): 6049.5 | throughput per GPU (TFLOP/s/GPU): 74.6 | MFU 7.55% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.853826E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:19:35.864649 | finish at 2025-09-10 12:20:47 + [2025-09-10 05:01:17] iteration 7561/ 11920 | consumed samples: 7742464 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851987E+00 | loss scale: 1.0 | grad norm: 0.248 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:48:18.277888 | finish at 2025-09-10 11:49:35 + [2025-09-10 05:01:23] iteration 7562/ 11920 | consumed samples: 7743488 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845373E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:48:18.299651 | finish at 2025-09-10 11:49:41 + [2025-09-10 05:01:28] iteration 7563/ 11920 | consumed samples: 7744512 | elapsed time per iteration (ms): 5630.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841058E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:48:51.991194 | finish at 2025-09-10 11:50:20 + [2025-09-10 05:01:34] iteration 7564/ 11920 | consumed samples: 7745536 | elapsed time per iteration (ms): 5631.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.833485E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:48:49.316434 | finish at 2025-09-10 11:50:23 + [2025-09-10 05:01:40] iteration 7565/ 11920 | consumed samples: 7746560 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841236E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:47:57.036994 | finish at 2025-09-10 11:49:37 + [2025-09-10 05:01:45] iteration 7566/ 11920 | consumed samples: 7747584 | elapsed time per iteration (ms): 5630.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846606E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:48:35.425717 | finish at 2025-09-10 11:50:21 + [2025-09-10 05:01:51] iteration 7567/ 11920 | consumed samples: 7748608 | elapsed time per iteration (ms): 5635.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860756E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:48:49.670763 | finish at 2025-09-10 11:50:40 + [2025-09-10 05:01:56] iteration 7568/ 11920 | consumed samples: 7749632 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849739E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:48:04.893311 | finish at 2025-09-10 11:50:01 + [2025-09-10 05:02:02] iteration 7569/ 11920 | consumed samples: 7750656 | elapsed time per iteration (ms): 5972.4 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832346E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:13:05.778363 | finish at 2025-09-10 12:15:08 + [2025-09-10 05:02:08] iteration 7570/ 11920 | consumed samples: 7751680 | elapsed time per iteration (ms): 5622.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841100E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:47:37.660067 | finish at 2025-09-10 11:49:46 + [2025-09-10 05:02:14] iteration 7571/ 11920 | consumed samples: 7752704 | elapsed time per iteration (ms): 5928.7 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845804E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:09:43.753380 | finish at 2025-09-10 12:11:58 + [2025-09-10 05:02:20] iteration 7572/ 11920 | consumed samples: 7753728 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.847274E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:47:26.355041 | finish at 2025-09-10 11:49:46 + [2025-09-10 05:02:25] iteration 7573/ 11920 | consumed samples: 7754752 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848330E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:47:26.720955 | finish at 2025-09-10 11:49:52 + [2025-09-10 05:02:31] iteration 7574/ 11920 | consumed samples: 7755776 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846879E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:47:12.137403 | finish at 2025-09-10 11:49:43 + [2025-09-10 05:02:36] iteration 7575/ 11920 | consumed samples: 7756800 | elapsed time per iteration (ms): 5616.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844047E+00 | loss scale: 1.0 | grad norm: 0.130 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:46:44.279441 | finish at 2025-09-10 11:49:21 + [2025-09-10 05:02:42] iteration 7576/ 11920 | consumed samples: 7757824 | elapsed time per iteration (ms): 5827.4 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844116E+00 | loss scale: 1.0 | grad norm: 0.131 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:01:54.390141 | finish at 2025-09-10 12:04:37 + [2025-09-10 05:02:48] iteration 7577/ 11920 | consumed samples: 7758848 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835470E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:47:08.430668 | finish at 2025-09-10 11:49:56 + [2025-09-10 05:02:54] iteration 7578/ 11920 | consumed samples: 7759872 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848594E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:47:14.810221 | finish at 2025-09-10 11:50:08 + [2025-09-10 05:02:59] iteration 7579/ 11920 | consumed samples: 7760896 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846082E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:47:00.183565 | finish at 2025-09-10 11:49:59 + [2025-09-10 05:03:05] iteration 7580/ 11920 | consumed samples: 7761920 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852174E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:46:37.163134 | finish at 2025-09-10 11:49:42 + [2025-09-10 05:03:10] iteration 7581/ 11920 | consumed samples: 7762944 | elapsed time per iteration (ms): 5616.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843766E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:46:11.911029 | finish at 2025-09-10 11:49:22 + [2025-09-10 05:03:16] iteration 7582/ 11920 | consumed samples: 7763968 | elapsed time per iteration (ms): 5617.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838593E+00 | loss scale: 1.0 | grad norm: 0.257 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:46:07.983034 | finish at 2025-09-10 11:49:24 + [2025-09-10 05:03:22] iteration 7583/ 11920 | consumed samples: 7764992 | elapsed time per iteration (ms): 5628.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856713E+00 | loss scale: 1.0 | grad norm: 0.254 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:46:49.748698 | finish at 2025-09-10 11:50:11 + [2025-09-10 05:03:27] iteration 7584/ 11920 | consumed samples: 7766016 | elapsed time per iteration (ms): 5629.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859421E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:46:50.093639 | finish at 2025-09-10 11:50:17 + [2025-09-10 05:03:33] iteration 7585/ 11920 | consumed samples: 7767040 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843153E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:46:10.592684 | finish at 2025-09-10 11:49:43 + [2025-09-10 05:03:39] iteration 7586/ 11920 | consumed samples: 7768064 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848279E+00 | loss scale: 1.0 | grad norm: 0.247 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:46:24.456950 | finish at 2025-09-10 11:50:03 + [2025-09-10 05:03:44] iteration 7587/ 11920 | consumed samples: 7769088 | elapsed time per iteration (ms): 5976.8 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844028E+00 | loss scale: 1.0 | grad norm: 0.290 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:11:37.481479 | finish at 2025-09-10 12:15:22 + [2025-09-10 05:03:50] iteration 7588/ 11920 | consumed samples: 7770112 | elapsed time per iteration (ms): 6001.4 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.866855E+00 | loss scale: 1.0 | grad norm: 0.267 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:13:18.090594 | finish at 2025-09-10 12:17:09 + [2025-09-10 05:03:56] iteration 7589/ 11920 | consumed samples: 7771136 | elapsed time per iteration (ms): 5986.3 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.864830E+00 | loss scale: 1.0 | grad norm: 0.244 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:12:06.460811 | finish at 2025-09-10 12:16:03 + [2025-09-10 05:04:02] iteration 7590/ 11920 | consumed samples: 7772160 | elapsed time per iteration (ms): 5954.7 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859156E+00 | loss scale: 1.0 | grad norm: 0.244 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:09:43.732250 | finish at 2025-09-10 12:13:46 + [2025-09-10 05:04:08] iteration 7591/ 11920 | consumed samples: 7773184 | elapsed time per iteration (ms): 5855.6 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845786E+00 | loss scale: 1.0 | grad norm: 0.250 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:02:29.099336 | finish at 2025-09-10 12:06:37 + [2025-09-10 05:04:14] iteration 7592/ 11920 | consumed samples: 7774208 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843513E+00 | loss scale: 1.0 | grad norm: 0.240 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:45:27.955479 | finish at 2025-09-10 11:49:42 + [2025-09-10 05:04:20] iteration 7593/ 11920 | consumed samples: 7775232 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838315E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:45:20.547621 | finish at 2025-09-10 11:49:40 + [2025-09-10 05:04:25] iteration 7594/ 11920 | consumed samples: 7776256 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855256E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:45:15.089933 | finish at 2025-09-10 11:49:40 + [2025-09-10 05:04:31] iteration 7595/ 11920 | consumed samples: 7777280 | elapsed time per iteration (ms): 6039.9 | throughput per GPU (TFLOP/s/GPU): 74.8 | MFU 7.56% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.853138E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:15:22.705954 | finish at 2025-09-10 12:19:54 + [2025-09-10 05:04:37] iteration 7596/ 11920 | consumed samples: 7778304 | elapsed time per iteration (ms): 5843.7 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851918E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:01:08.319330 | finish at 2025-09-10 12:05:45 + [2025-09-10 05:04:43] iteration 7597/ 11920 | consumed samples: 7779328 | elapsed time per iteration (ms): 5622.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836859E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:45:06.416654 | finish at 2025-09-10 11:49:49 + [2025-09-10 05:04:48] iteration 7598/ 11920 | consumed samples: 7780352 | elapsed time per iteration (ms): 5636.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842869E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:46:00.268273 | finish at 2025-09-10 11:50:49 + [2025-09-10 05:04:54] iteration 7599/ 11920 | consumed samples: 7781376 | elapsed time per iteration (ms): 5625.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.853481E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:45:09.336835 | finish at 2025-09-10 11:50:03 + [2025-09-10 05:05:00] iteration 7600/ 11920 | consumed samples: 7782400 | elapsed time per iteration (ms): 5631.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.864739E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:45:26.148834 | finish at 2025-09-10 11:50:26 + [2025-09-10 05:05:05] iteration 7601/ 11920 | consumed samples: 7783424 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842327E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:44:40.103974 | finish at 2025-09-10 11:49:45 + [2025-09-10 05:05:11] iteration 7602/ 11920 | consumed samples: 7784448 | elapsed time per iteration (ms): 5924.2 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845411E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:06:20.821448 | finish at 2025-09-10 12:11:32 + [2025-09-10 05:05:17] iteration 7603/ 11920 | consumed samples: 7785472 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.833324E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:44:40.004305 | finish at 2025-09-10 11:49:57 + [2025-09-10 05:05:23] iteration 7604/ 11920 | consumed samples: 7786496 | elapsed time per iteration (ms): 5835.5 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838839E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:59:45.928484 | finish at 2025-09-10 12:05:08 + [2025-09-10 05:05:28] iteration 7605/ 11920 | consumed samples: 7787520 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852085E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:44:32.627035 | finish at 2025-09-10 11:50:01 + [2025-09-10 05:05:34] iteration 7606/ 11920 | consumed samples: 7788544 | elapsed time per iteration (ms): 5628.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850401E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:44:41.744921 | finish at 2025-09-10 11:50:16 + [2025-09-10 05:05:39] iteration 7607/ 11920 | consumed samples: 7789568 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848953E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:44:07.749664 | finish at 2025-09-10 11:49:47 + [2025-09-10 05:05:45] iteration 7608/ 11920 | consumed samples: 7790592 | elapsed time per iteration (ms): 5628.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855816E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:44:28.854149 | finish at 2025-09-10 11:50:14 + [2025-09-10 05:05:51] iteration 7609/ 11920 | consumed samples: 7791616 | elapsed time per iteration (ms): 5634.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846772E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:44:52.142694 | finish at 2025-09-10 11:50:43 + [2025-09-10 05:05:56] iteration 7610/ 11920 | consumed samples: 7792640 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839840E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:43:53.985896 | finish at 2025-09-10 11:49:50 + [2025-09-10 05:06:02] iteration 7611/ 11920 | consumed samples: 7793664 | elapsed time per iteration (ms): 5636.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850404E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:44:47.799217 | finish at 2025-09-10 11:50:50 + [2025-09-10 05:06:08] iteration 7612/ 11920 | consumed samples: 7794688 | elapsed time per iteration (ms): 5627.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.853325E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:44:04.393902 | finish at 2025-09-10 11:50:12 + [2025-09-10 05:06:13] iteration 7613/ 11920 | consumed samples: 7795712 | elapsed time per iteration (ms): 5644.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842501E+00 | loss scale: 1.0 | grad norm: 0.128 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:45:09.208315 | finish at 2025-09-10 11:51:22 + [2025-09-10 05:06:19] iteration 7614/ 11920 | consumed samples: 7796736 | elapsed time per iteration (ms): 5932.6 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840754E+00 | loss scale: 1.0 | grad norm: 0.132 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:05:45.687402 | finish at 2025-09-10 12:12:05 + [2025-09-10 05:06:25] iteration 7615/ 11920 | consumed samples: 7797760 | elapsed time per iteration (ms): 5617.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846686E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:43:05.040572 | finish at 2025-09-10 11:49:30 + [2025-09-10 05:06:30] iteration 7616/ 11920 | consumed samples: 7798784 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852616E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:43:08.127537 | finish at 2025-09-10 11:49:39 + [2025-09-10 05:06:36] iteration 7617/ 11920 | consumed samples: 7799808 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844093E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:43:21.032568 | finish at 2025-09-10 11:49:57 + [2025-09-10 05:06:42] iteration 7618/ 11920 | consumed samples: 7800832 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.853314E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:43:02.398662 | finish at 2025-09-10 11:49:44 + [2025-09-10 05:06:47] iteration 7619/ 11920 | consumed samples: 7801856 | elapsed time per iteration (ms): 5627.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848135E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:43:22.281139 | finish at 2025-09-10 11:50:10 + [2025-09-10 05:06:53] iteration 7620/ 11920 | consumed samples: 7802880 | elapsed time per iteration (ms): 5631.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840634E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:43:34.580655 | finish at 2025-09-10 11:50:27 + [2025-09-10 05:06:59] iteration 7621/ 11920 | consumed samples: 7803904 | elapsed time per iteration (ms): 5630.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846510E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:43:26.088690 | finish at 2025-09-10 11:50:25 + [2025-09-10 05:07:04] iteration 7622/ 11920 | consumed samples: 7804928 | elapsed time per iteration (ms): 5628.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837353E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:43:10.361461 | finish at 2025-09-10 11:50:15 + [2025-09-10 05:07:10] iteration 7623/ 11920 | consumed samples: 7805952 | elapsed time per iteration (ms): 5626.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834427E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:42:56.876405 | finish at 2025-09-10 11:50:07 + [2025-09-10 05:07:15] iteration 7624/ 11920 | consumed samples: 7806976 | elapsed time per iteration (ms): 5629.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.847912E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:43:03.014442 | finish at 2025-09-10 11:50:18 + [2025-09-10 05:07:21] iteration 7625/ 11920 | consumed samples: 7808000 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845863E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:42:39.070870 | finish at 2025-09-10 11:50:00 + [2025-09-10 05:07:27] iteration 7626/ 11920 | consumed samples: 7809024 | elapsed time per iteration (ms): 5626.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859582E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:42:40.436238 | finish at 2025-09-10 11:50:07 + [2025-09-10 05:07:32] iteration 7627/ 11920 | consumed samples: 7810048 | elapsed time per iteration (ms): 5618.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859123E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:42:01.187712 | finish at 2025-09-10 11:49:33 + [2025-09-10 05:07:38] iteration 7628/ 11920 | consumed samples: 7811072 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842710E+00 | loss scale: 1.0 | grad norm: 0.238 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:42:06.537660 | finish at 2025-09-10 11:49:44 + [2025-09-10 05:07:44] iteration 7629/ 11920 | consumed samples: 7812096 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857714E+00 | loss scale: 1.0 | grad norm: 0.238 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:42:10.339730 | finish at 2025-09-10 11:49:54 + [2025-09-10 05:07:49] iteration 7630/ 11920 | consumed samples: 7813120 | elapsed time per iteration (ms): 5841.5 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835564E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:57:40.247934 | finish at 2025-09-10 12:05:30 + [2025-09-10 05:07:55] iteration 7631/ 11920 | consumed samples: 7814144 | elapsed time per iteration (ms): 5630.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845604E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:42:30.124929 | finish at 2025-09-10 11:50:25 + [2025-09-10 05:08:01] iteration 7632/ 11920 | consumed samples: 7815168 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849888E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:42:05.416351 | finish at 2025-09-10 11:50:06 + [2025-09-10 05:08:06] iteration 7633/ 11920 | consumed samples: 7816192 | elapsed time per iteration (ms): 5619.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.833460E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:41:30.442518 | finish at 2025-09-10 11:49:37 + [2025-09-10 05:08:12] iteration 7634/ 11920 | consumed samples: 7817216 | elapsed time per iteration (ms): 5627.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.847444E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:42:00.198942 | finish at 2025-09-10 11:50:12 + [2025-09-10 05:08:17] iteration 7635/ 11920 | consumed samples: 7818240 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852124E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:41:33.991685 | finish at 2025-09-10 11:49:51 + [2025-09-10 05:08:23] iteration 7636/ 11920 | consumed samples: 7819264 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842077E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:41:30.949857 | finish at 2025-09-10 11:49:54 + [2025-09-10 05:08:29] iteration 7637/ 11920 | consumed samples: 7820288 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851491E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:41:41.902661 | finish at 2025-09-10 11:50:11 + [2025-09-10 05:08:34] iteration 7638/ 11920 | consumed samples: 7821312 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.847579E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:41:16.339022 | finish at 2025-09-10 11:49:51 + [2025-09-10 05:08:40] iteration 7639/ 11920 | consumed samples: 7822336 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839472E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:41:04.099333 | finish at 2025-09-10 11:49:44 + [2025-09-10 05:08:46] iteration 7640/ 11920 | consumed samples: 7823360 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852677E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:41:09.093742 | finish at 2025-09-10 11:49:55 + [2025-09-10 05:08:51] iteration 7641/ 11920 | consumed samples: 7824384 | elapsed time per iteration (ms): 5636.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849301E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:42:00.446887 | finish at 2025-09-10 11:50:52 + [2025-09-10 05:08:57] iteration 7642/ 11920 | consumed samples: 7825408 | elapsed time per iteration (ms): 5622.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861864E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:40:53.355642 | finish at 2025-09-10 11:49:50 + [2025-09-10 05:09:03] iteration 7643/ 11920 | consumed samples: 7826432 | elapsed time per iteration (ms): 5633.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849476E+00 | loss scale: 1.0 | grad norm: 0.248 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:41:33.823227 | finish at 2025-09-10 11:50:36 + [2025-09-10 05:09:08] iteration 7644/ 11920 | consumed samples: 7827456 | elapsed time per iteration (ms): 5621.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852219E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:40:35.528752 | finish at 2025-09-10 11:49:44 + [2025-09-10 05:09:14] iteration 7645/ 11920 | consumed samples: 7828480 | elapsed time per iteration (ms): 5630.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845757E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:41:08.939495 | finish at 2025-09-10 11:50:23 + [2025-09-10 05:09:19] iteration 7646/ 11920 | consumed samples: 7829504 | elapsed time per iteration (ms): 5631.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843243E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:41:09.014720 | finish at 2025-09-10 11:50:28 + [2025-09-10 05:09:25] iteration 7647/ 11920 | consumed samples: 7830528 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845568E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:40:19.271823 | finish at 2025-09-10 11:49:44 + [2025-09-10 05:09:31] iteration 7648/ 11920 | consumed samples: 7831552 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.833698E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:40:11.684898 | finish at 2025-09-10 11:49:42 + [2025-09-10 05:09:36] iteration 7649/ 11920 | consumed samples: 7832576 | elapsed time per iteration (ms): 5862.7 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855162E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:57:19.523413 | finish at 2025-09-10 12:06:56 + [2025-09-10 05:09:42] iteration 7650/ 11920 | consumed samples: 7833600 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839543E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:40:03.882437 | finish at 2025-09-10 11:49:46 + [2025-09-10 05:09:48] iteration 7651/ 11920 | consumed samples: 7834624 | elapsed time per iteration (ms): 5627.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846542E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:40:24.678149 | finish at 2025-09-10 11:50:12 + [2025-09-10 05:09:53] iteration 7652/ 11920 | consumed samples: 7835648 | elapsed time per iteration (ms): 5630.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838744E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:40:32.804943 | finish at 2025-09-10 11:50:26 + [2025-09-10 05:09:59] iteration 7653/ 11920 | consumed samples: 7836672 | elapsed time per iteration (ms): 5625.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846769E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:40:05.765278 | finish at 2025-09-10 11:50:05 + [2025-09-10 05:10:05] iteration 7654/ 11920 | consumed samples: 7837696 | elapsed time per iteration (ms): 5631.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835074E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:40:23.118563 | finish at 2025-09-10 11:50:28 + [2025-09-10 05:10:10] iteration 7655/ 11920 | consumed samples: 7838720 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836486E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:39:56.331592 | finish at 2025-09-10 11:50:07 + [2025-09-10 05:10:16] iteration 7656/ 11920 | consumed samples: 7839744 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830080E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:39:41.535370 | finish at 2025-09-10 11:49:57 + [2025-09-10 05:10:22] iteration 7657/ 11920 | consumed samples: 7840768 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846870E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:39:26.406012 | finish at 2025-09-10 11:49:48 + [2025-09-10 05:10:27] iteration 7658/ 11920 | consumed samples: 7841792 | elapsed time per iteration (ms): 5617.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850276E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:39:01.693832 | finish at 2025-09-10 11:49:29 + [2025-09-10 05:10:33] iteration 7659/ 11920 | consumed samples: 7842816 | elapsed time per iteration (ms): 5616.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845510E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:38:50.881033 | finish at 2025-09-10 11:49:24 + [2025-09-10 05:10:38] iteration 7660/ 11920 | consumed samples: 7843840 | elapsed time per iteration (ms): 5627.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852625E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:39:30.814219 | finish at 2025-09-10 11:50:09 + [2025-09-10 05:10:44] iteration 7661/ 11920 | consumed samples: 7844864 | elapsed time per iteration (ms): 5975.0 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831455E+00 | loss scale: 1.0 | grad norm: 0.255 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:04:07.454530 | finish at 2025-09-10 12:14:52 + [2025-09-10 05:10:50] iteration 7662/ 11920 | consumed samples: 7845888 | elapsed time per iteration (ms): 5630.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846435E+00 | loss scale: 1.0 | grad norm: 0.272 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:39:36.121049 | finish at 2025-09-10 11:50:26 + [2025-09-10 05:10:56] iteration 7663/ 11920 | consumed samples: 7846912 | elapsed time per iteration (ms): 5816.8 | throughput per GPU (TFLOP/s/GPU): 77.6 | MFU 7.85% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852595E+00 | loss scale: 1.0 | grad norm: 0.304 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:52:42.221625 | finish at 2025-09-10 12:03:38 + [2025-09-10 05:11:01] iteration 7664/ 11920 | consumed samples: 7847936 | elapsed time per iteration (ms): 5623.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857148E+00 | loss scale: 1.0 | grad norm: 0.303 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:38:51.940163 | finish at 2025-09-10 11:49:53 + [2025-09-10 05:11:07] iteration 7665/ 11920 | consumed samples: 7848960 | elapsed time per iteration (ms): 5953.4 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862386E+00 | loss scale: 1.0 | grad norm: 0.258 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:02:11.763226 | finish at 2025-09-10 12:13:19 + [2025-09-10 05:11:13] iteration 7666/ 11920 | consumed samples: 7849984 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846169E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:38:36.822624 | finish at 2025-09-10 11:49:50 + [2025-09-10 05:11:19] iteration 7667/ 11920 | consumed samples: 7851008 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841414E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:38:27.278298 | finish at 2025-09-10 11:49:46 + [2025-09-10 05:11:24] iteration 7668/ 11920 | consumed samples: 7852032 | elapsed time per iteration (ms): 5619.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846191E+00 | loss scale: 1.0 | grad norm: 0.244 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:38:12.686299 | finish at 2025-09-10 11:49:37 + [2025-09-10 05:11:30] iteration 7669/ 11920 | consumed samples: 7853056 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854574E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:38:15.143855 | finish at 2025-09-10 11:49:45 + [2025-09-10 05:11:35] iteration 7670/ 11920 | consumed samples: 7854080 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852712E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:38:31.599100 | finish at 2025-09-10 11:50:07 + [2025-09-10 05:11:41] iteration 7671/ 11920 | consumed samples: 7855104 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843926E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:38:08.276036 | finish at 2025-09-10 11:49:49 + [2025-09-10 05:11:47] iteration 7672/ 11920 | consumed samples: 7856128 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854079E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:38:03.396326 | finish at 2025-09-10 11:49:50 + [2025-09-10 05:11:52] iteration 7673/ 11920 | consumed samples: 7857152 | elapsed time per iteration (ms): 5623.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850631E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:38:00.671002 | finish at 2025-09-10 11:49:53 + [2025-09-10 05:11:58] iteration 7674/ 11920 | consumed samples: 7858176 | elapsed time per iteration (ms): 5621.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841978E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:37:46.847205 | finish at 2025-09-10 11:49:45 + [2025-09-10 05:12:04] iteration 7675/ 11920 | consumed samples: 7859200 | elapsed time per iteration (ms): 5631.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839873E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:38:25.522192 | finish at 2025-09-10 11:50:29 + [2025-09-10 05:12:09] iteration 7676/ 11920 | consumed samples: 7860224 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839485E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:37:37.211982 | finish at 2025-09-10 11:49:46 + [2025-09-10 05:12:15] iteration 7677/ 11920 | consumed samples: 7861248 | elapsed time per iteration (ms): 5627.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829827E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:37:56.338612 | finish at 2025-09-10 11:50:11 + [2025-09-10 05:12:20] iteration 7678/ 11920 | consumed samples: 7862272 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851910E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:37:18.005646 | finish at 2025-09-10 11:49:38 + [2025-09-10 05:12:26] iteration 7679/ 11920 | consumed samples: 7863296 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839799E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:37:38.432917 | finish at 2025-09-10 11:50:05 + [2025-09-10 05:12:32] iteration 7680/ 11920 | consumed samples: 7864320 | elapsed time per iteration (ms): 5629.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840500E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:37:46.882954 | finish at 2025-09-10 11:50:19 + [2025-09-10 05:12:37] iteration 7681/ 11920 | consumed samples: 7865344 | elapsed time per iteration (ms): 5632.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836728E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:37:54.282343 | finish at 2025-09-10 11:50:32 + [2025-09-10 05:12:43] iteration 7682/ 11920 | consumed samples: 7866368 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834373E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:37:16.118872 | finish at 2025-09-10 11:49:59 + [2025-09-10 05:12:49] iteration 7683/ 11920 | consumed samples: 7867392 | elapsed time per iteration (ms): 5974.9 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831808E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:01:55.700830 | finish at 2025-09-10 12:14:45 + [2025-09-10 05:12:55] iteration 7684/ 11920 | consumed samples: 7868416 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841708E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:37:04.836785 | finish at 2025-09-10 11:49:59 + [2025-09-10 05:13:00] iteration 7685/ 11920 | consumed samples: 7869440 | elapsed time per iteration (ms): 5628.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844656E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:37:17.843448 | finish at 2025-09-10 11:50:18 + [2025-09-10 05:13:06] iteration 7686/ 11920 | consumed samples: 7870464 | elapsed time per iteration (ms): 5627.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854964E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:37:06.499091 | finish at 2025-09-10 11:50:12 + [2025-09-10 05:13:11] iteration 7687/ 11920 | consumed samples: 7871488 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846193E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:36:35.326145 | finish at 2025-09-10 11:49:47 + [2025-09-10 05:13:17] iteration 7688/ 11920 | consumed samples: 7872512 | elapsed time per iteration (ms): 5852.3 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852705E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:52:46.830381 | finish at 2025-09-10 12:06:04 + [2025-09-10 05:13:23] iteration 7689/ 11920 | consumed samples: 7873536 | elapsed time per iteration (ms): 5618.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846167E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:36:11.449801 | finish at 2025-09-10 11:49:34 + [2025-09-10 05:13:29] iteration 7690/ 11920 | consumed samples: 7874560 | elapsed time per iteration (ms): 5618.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839415E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:36:05.488508 | finish at 2025-09-10 11:49:34 + [2025-09-10 05:13:34] iteration 7691/ 11920 | consumed samples: 7875584 | elapsed time per iteration (ms): 5939.5 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842834E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 12.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:58:38.160011 | finish at 2025-09-10 12:12:13 + [2025-09-10 05:13:40] iteration 7692/ 11920 | consumed samples: 7876608 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849637E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:36:08.017579 | finish at 2025-09-10 11:49:48 + [2025-09-10 05:13:46] iteration 7693/ 11920 | consumed samples: 7877632 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830041E+00 | loss scale: 1.0 | grad norm: 0.238 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:36:00.867179 | finish at 2025-09-10 11:49:47 + [2025-09-10 05:13:51] iteration 7694/ 11920 | consumed samples: 7878656 | elapsed time per iteration (ms): 5652.1 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837927E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:38:05.772949 | finish at 2025-09-10 11:51:57 + [2025-09-10 05:13:57] iteration 7695/ 11920 | consumed samples: 7879680 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828682E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:36:16.193786 | finish at 2025-09-10 11:50:13 + [2025-09-10 05:14:03] iteration 7696/ 11920 | consumed samples: 7880704 | elapsed time per iteration (ms): 5626.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849445E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:36:07.938812 | finish at 2025-09-10 11:50:11 + [2025-09-10 05:14:08] iteration 7697/ 11920 | consumed samples: 7881728 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851947E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:35:55.062673 | finish at 2025-09-10 11:50:03 + [2025-09-10 05:14:14] iteration 7698/ 11920 | consumed samples: 7882752 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.833841E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:35:41.135046 | finish at 2025-09-10 11:49:55 + [2025-09-10 05:14:20] iteration 7699/ 11920 | consumed samples: 7883776 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851233E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:35:29.311636 | finish at 2025-09-10 11:49:49 + [2025-09-10 05:14:25] iteration 7700/ 11920 | consumed samples: 7884800 | elapsed time per iteration (ms): 5979.9 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849467E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:00:35.091987 | finish at 2025-09-10 12:15:01 + [2025-09-10 05:14:31] iteration 7701/ 11920 | consumed samples: 7885824 | elapsed time per iteration (ms): 5922.2 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844348E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:56:25.840110 | finish at 2025-09-10 12:10:57 + [2025-09-10 05:14:37] iteration 7702/ 11920 | consumed samples: 7886848 | elapsed time per iteration (ms): 5856.7 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.847075E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:51:43.576452 | finish at 2025-09-10 12:06:21 + [2025-09-10 05:14:44] iteration 7703/ 11920 | consumed samples: 7887872 | elapsed time per iteration (ms): 6284.5 | throughput per GPU (TFLOP/s/GPU): 71.8 | MFU 7.26% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846959E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:21:41.537943 | finish at 2025-09-10 12:36:25 + [2025-09-10 05:14:49] iteration 7704/ 11920 | consumed samples: 7888896 | elapsed time per iteration (ms): 5865.8 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839658E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:52:10.059608 | finish at 2025-09-10 12:06:59 + [2025-09-10 05:14:56] iteration 7705/ 11920 | consumed samples: 7889920 | elapsed time per iteration (ms): 6250.1 | throughput per GPU (TFLOP/s/GPU): 72.2 | MFU 7.30% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837686E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:19:04.332862 | finish at 2025-09-10 12:34:00 + [2025-09-10 05:15:01] iteration 7706/ 11920 | consumed samples: 7890944 | elapsed time per iteration (ms): 5824.8 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.84% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.833417E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:49:05.645975 | finish at 2025-09-10 12:04:07 + [2025-09-10 05:15:07] iteration 7707/ 11920 | consumed samples: 7891968 | elapsed time per iteration (ms): 5888.1 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.853261E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:53:26.589023 | finish at 2025-09-10 12:08:34 + [2025-09-10 05:15:13] iteration 7708/ 11920 | consumed samples: 7892992 | elapsed time per iteration (ms): 6006.5 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.847610E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:01:39.436269 | finish at 2025-09-10 12:16:53 + [2025-09-10 05:15:19] iteration 7709/ 11920 | consumed samples: 7894016 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830728E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:34:28.321438 | finish at 2025-09-10 11:49:47 + [2025-09-10 05:15:25] iteration 7710/ 11920 | consumed samples: 7895040 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836579E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:34:29.067581 | finish at 2025-09-10 11:49:54 + [2025-09-10 05:15:30] iteration 7711/ 11920 | consumed samples: 7896064 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848670E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:34:16.232289 | finish at 2025-09-10 11:49:46 + [2025-09-10 05:15:36] iteration 7712/ 11920 | consumed samples: 7897088 | elapsed time per iteration (ms): 5896.3 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836221E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:53:31.813320 | finish at 2025-09-10 12:09:08 + [2025-09-10 05:15:42] iteration 7713/ 11920 | consumed samples: 7898112 | elapsed time per iteration (ms): 5829.0 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848047E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:48:42.431466 | finish at 2025-09-10 12:04:24 + [2025-09-10 05:15:48] iteration 7714/ 11920 | consumed samples: 7899136 | elapsed time per iteration (ms): 5626.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830140E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:34:26.797378 | finish at 2025-09-10 11:50:14 + [2025-09-10 05:15:53] iteration 7715/ 11920 | consumed samples: 7900160 | elapsed time per iteration (ms): 5640.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818914E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:35:16.428020 | finish at 2025-09-10 11:51:10 + [2025-09-10 05:15:59] iteration 7716/ 11920 | consumed samples: 7901184 | elapsed time per iteration (ms): 5619.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830428E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:33:44.785612 | finish at 2025-09-10 11:49:44 + [2025-09-10 05:16:04] iteration 7717/ 11920 | consumed samples: 7902208 | elapsed time per iteration (ms): 5628.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841847E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:34:16.760799 | finish at 2025-09-10 11:50:21 + [2025-09-10 05:16:10] iteration 7718/ 11920 | consumed samples: 7903232 | elapsed time per iteration (ms): 5624.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827072E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:33:54.698158 | finish at 2025-09-10 11:50:05 + [2025-09-10 05:16:16] iteration 7719/ 11920 | consumed samples: 7904256 | elapsed time per iteration (ms): 5925.0 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838098E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:54:50.870713 | finish at 2025-09-10 12:11:07 + [2025-09-10 05:16:22] iteration 7720/ 11920 | consumed samples: 7905280 | elapsed time per iteration (ms): 5964.9 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835906E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:57:32.626133 | finish at 2025-09-10 12:13:55 + [2025-09-10 05:16:28] iteration 7721/ 11920 | consumed samples: 7906304 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841454E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:33:28.197500 | finish at 2025-09-10 11:49:56 + [2025-09-10 05:16:33] iteration 7722/ 11920 | consumed samples: 7907328 | elapsed time per iteration (ms): 5834.6 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825856E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:48:13.800814 | finish at 2025-09-10 12:04:47 + [2025-09-10 05:16:39] iteration 7723/ 11920 | consumed samples: 7908352 | elapsed time per iteration (ms): 5617.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839751E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:32:54.718541 | finish at 2025-09-10 11:49:34 + [2025-09-10 05:16:45] iteration 7724/ 11920 | consumed samples: 7909376 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838466E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:33:19.961974 | finish at 2025-09-10 11:50:05 + [2025-09-10 05:16:50] iteration 7725/ 11920 | consumed samples: 7910400 | elapsed time per iteration (ms): 5637.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.847577E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:34:08.558575 | finish at 2025-09-10 11:50:59 + [2025-09-10 05:16:56] iteration 7726/ 11920 | consumed samples: 7911424 | elapsed time per iteration (ms): 5630.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837269E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:33:32.212481 | finish at 2025-09-10 11:50:28 + [2025-09-10 05:17:02] iteration 7727/ 11920 | consumed samples: 7912448 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838162E+00 | loss scale: 1.0 | grad norm: 0.246 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:33:09.108917 | finish at 2025-09-10 11:50:11 + [2025-09-10 05:17:07] iteration 7728/ 11920 | consumed samples: 7913472 | elapsed time per iteration (ms): 5627.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844860E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:33:08.334419 | finish at 2025-09-10 11:50:16 + [2025-09-10 05:17:13] iteration 7729/ 11920 | consumed samples: 7914496 | elapsed time per iteration (ms): 5619.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846175E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:32:30.885518 | finish at 2025-09-10 11:49:44 + [2025-09-10 05:17:18] iteration 7730/ 11920 | consumed samples: 7915520 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828277E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:32:51.086600 | finish at 2025-09-10 11:50:10 + [2025-09-10 05:17:24] iteration 7731/ 11920 | consumed samples: 7916544 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829717E+00 | loss scale: 1.0 | grad norm: 0.247 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:32:37.281399 | finish at 2025-09-10 11:50:01 + [2025-09-10 05:17:30] iteration 7732/ 11920 | consumed samples: 7917568 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.833611E+00 | loss scale: 1.0 | grad norm: 0.250 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:32:36.835999 | finish at 2025-09-10 11:50:07 + [2025-09-10 05:17:36] iteration 7733/ 11920 | consumed samples: 7918592 | elapsed time per iteration (ms): 5936.9 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850812E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:54:17.637167 | finish at 2025-09-10 12:11:53 + [2025-09-10 05:17:41] iteration 7734/ 11920 | consumed samples: 7919616 | elapsed time per iteration (ms): 5631.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840062E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:32:53.875198 | finish at 2025-09-10 11:50:35 + [2025-09-10 05:17:47] iteration 7735/ 11920 | consumed samples: 7920640 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848463E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:32:21.201718 | finish at 2025-09-10 11:50:08 + [2025-09-10 05:17:53] iteration 7736/ 11920 | consumed samples: 7921664 | elapsed time per iteration (ms): 5635.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839865E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:33:00.431116 | finish at 2025-09-10 11:50:53 + [2025-09-10 05:17:58] iteration 7737/ 11920 | consumed samples: 7922688 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835918E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:31:55.478553 | finish at 2025-09-10 11:49:54 + [2025-09-10 05:18:04] iteration 7738/ 11920 | consumed samples: 7923712 | elapsed time per iteration (ms): 5844.6 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841539E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:47:21.985429 | finish at 2025-09-10 12:05:26 + [2025-09-10 05:18:10] iteration 7739/ 11920 | consumed samples: 7924736 | elapsed time per iteration (ms): 5945.4 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843551E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:54:17.779202 | finish at 2025-09-10 12:12:28 + [2025-09-10 05:18:16] iteration 7740/ 11920 | consumed samples: 7925760 | elapsed time per iteration (ms): 5625.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840440E+00 | loss scale: 1.0 | grad norm: 0.245 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:31:54.162312 | finish at 2025-09-10 11:50:10 + [2025-09-10 05:18:21] iteration 7741/ 11920 | consumed samples: 7926784 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851109E+00 | loss scale: 1.0 | grad norm: 0.240 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:31:42.073583 | finish at 2025-09-10 11:50:03 + [2025-09-10 05:18:27] iteration 7742/ 11920 | consumed samples: 7927808 | elapsed time per iteration (ms): 5617.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830662E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:31:10.143389 | finish at 2025-09-10 11:49:37 + [2025-09-10 05:18:32] iteration 7743/ 11920 | consumed samples: 7928832 | elapsed time per iteration (ms): 5618.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.853689E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:31:08.545183 | finish at 2025-09-10 11:49:41 + [2025-09-10 05:18:38] iteration 7744/ 11920 | consumed samples: 7929856 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845793E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:31:14.378471 | finish at 2025-09-10 11:49:52 + [2025-09-10 05:18:44] iteration 7745/ 11920 | consumed samples: 7930880 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860249E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:31:16.695508 | finish at 2025-09-10 11:50:00 + [2025-09-10 05:18:49] iteration 7746/ 11920 | consumed samples: 7931904 | elapsed time per iteration (ms): 5631.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846462E+00 | loss scale: 1.0 | grad norm: 0.245 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:31:45.681992 | finish at 2025-09-10 11:50:35 + [2025-09-10 05:18:55] iteration 7747/ 11920 | consumed samples: 7932928 | elapsed time per iteration (ms): 5628.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848696E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:31:26.357446 | finish at 2025-09-10 11:50:21 + [2025-09-10 05:19:01] iteration 7748/ 11920 | consumed samples: 7933952 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841676E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:31:01.557768 | finish at 2025-09-10 11:50:02 + [2025-09-10 05:19:06] iteration 7749/ 11920 | consumed samples: 7934976 | elapsed time per iteration (ms): 5634.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854504E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:31:40.033801 | finish at 2025-09-10 11:50:46 + [2025-09-10 05:19:12] iteration 7750/ 11920 | consumed samples: 7936000 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.863494E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:30:36.666141 | finish at 2025-09-10 11:49:48 + [2025-09-10 05:19:17] iteration 7751/ 11920 | consumed samples: 7937024 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836170E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:30:33.729548 | finish at 2025-09-10 11:49:51 + [2025-09-10 05:19:23] iteration 7752/ 11920 | consumed samples: 7938048 | elapsed time per iteration (ms): 5837.6 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836069E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:45:31.052177 | finish at 2025-09-10 12:04:54 + [2025-09-10 05:19:29] iteration 7753/ 11920 | consumed samples: 7939072 | elapsed time per iteration (ms): 5618.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842627E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:30:13.483651 | finish at 2025-09-10 11:49:42 + [2025-09-10 05:19:35] iteration 7754/ 11920 | consumed samples: 7940096 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844390E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:30:13.817423 | finish at 2025-09-10 11:49:48 + [2025-09-10 05:19:40] iteration 7755/ 11920 | consumed samples: 7941120 | elapsed time per iteration (ms): 5938.0 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841610E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:52:11.679485 | finish at 2025-09-10 12:11:52 + [2025-09-10 05:19:46] iteration 7756/ 11920 | consumed samples: 7942144 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848608E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:30:01.981327 | finish at 2025-09-10 11:49:48 + [2025-09-10 05:19:52] iteration 7757/ 11920 | consumed samples: 7943168 | elapsed time per iteration (ms): 5635.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838786E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:31:01.729712 | finish at 2025-09-10 11:50:53 + [2025-09-10 05:19:58] iteration 7758/ 11920 | consumed samples: 7944192 | elapsed time per iteration (ms): 5941.8 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838648E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:52:09.973526 | finish at 2025-09-10 12:12:08 + [2025-09-10 05:20:03] iteration 7759/ 11920 | consumed samples: 7945216 | elapsed time per iteration (ms): 5626.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851961E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:30:10.435498 | finish at 2025-09-10 11:50:14 + [2025-09-10 05:20:09] iteration 7760/ 11920 | consumed samples: 7946240 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834363E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:29:53.614655 | finish at 2025-09-10 11:50:03 + [2025-09-10 05:20:15] iteration 7761/ 11920 | consumed samples: 7947264 | elapsed time per iteration (ms): 5940.0 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843900E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:51:44.263905 | finish at 2025-09-10 12:11:59 + [2025-09-10 05:20:20] iteration 7762/ 11920 | consumed samples: 7948288 | elapsed time per iteration (ms): 5626.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859325E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:29:53.726549 | finish at 2025-09-10 11:50:14 + [2025-09-10 05:20:26] iteration 7763/ 11920 | consumed samples: 7949312 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852969E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:29:26.771750 | finish at 2025-09-10 11:49:53 + [2025-09-10 05:20:32] iteration 7764/ 11920 | consumed samples: 7950336 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.833205E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:29:22.091018 | finish at 2025-09-10 11:49:54 + [2025-09-10 05:20:37] iteration 7765/ 11920 | consumed samples: 7951360 | elapsed time per iteration (ms): 5626.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849207E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:29:36.105977 | finish at 2025-09-10 11:50:13 + [2025-09-10 05:20:43] iteration 7766/ 11920 | consumed samples: 7952384 | elapsed time per iteration (ms): 5619.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843472E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:29:01.619972 | finish at 2025-09-10 11:49:45 + [2025-09-10 05:20:49] iteration 7767/ 11920 | consumed samples: 7953408 | elapsed time per iteration (ms): 5626.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841633E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:29:27.303578 | finish at 2025-09-10 11:50:16 + [2025-09-10 05:20:54] iteration 7768/ 11920 | consumed samples: 7954432 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849942E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:29:24.376465 | finish at 2025-09-10 11:50:19 + [2025-09-10 05:21:00] iteration 7769/ 11920 | consumed samples: 7955456 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855597E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:28:47.806012 | finish at 2025-09-10 11:49:48 + [2025-09-10 05:21:05] iteration 7770/ 11920 | consumed samples: 7956480 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851360E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:29:06.131575 | finish at 2025-09-10 11:50:12 + [2025-09-10 05:21:11] iteration 7771/ 11920 | consumed samples: 7957504 | elapsed time per iteration (ms): 5622.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849729E+00 | loss scale: 1.0 | grad norm: 0.230 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:28:47.659278 | finish at 2025-09-10 11:49:59 + [2025-09-10 05:21:17] iteration 7772/ 11920 | consumed samples: 7958528 | elapsed time per iteration (ms): 5618.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857281E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:28:25.407434 | finish at 2025-09-10 11:49:42 + [2025-09-10 05:21:23] iteration 7773/ 11920 | consumed samples: 7959552 | elapsed time per iteration (ms): 5847.7 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838676E+00 | loss scale: 1.0 | grad norm: 0.230 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:44:10.335273 | finish at 2025-09-10 12:05:33 + [2025-09-10 05:21:29] iteration 7774/ 11920 | consumed samples: 7960576 | elapsed time per iteration (ms): 5991.1 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852467E+00 | loss scale: 1.0 | grad norm: 0.278 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:53:59.181958 | finish at 2025-09-10 12:15:28 + [2025-09-10 05:21:34] iteration 7775/ 11920 | consumed samples: 7961600 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862467E+00 | loss scale: 1.0 | grad norm: 0.287 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:28:35.440198 | finish at 2025-09-10 11:50:10 + [2025-09-10 05:21:40] iteration 7776/ 11920 | consumed samples: 7962624 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836853E+00 | loss scale: 1.0 | grad norm: 0.253 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:28:28.922085 | finish at 2025-09-10 11:50:09 + [2025-09-10 05:21:45] iteration 7777/ 11920 | consumed samples: 7963648 | elapsed time per iteration (ms): 5626.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826828E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:28:31.156029 | finish at 2025-09-10 11:50:17 + [2025-09-10 05:21:51] iteration 7778/ 11920 | consumed samples: 7964672 | elapsed time per iteration (ms): 5628.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852368E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:28:31.949323 | finish at 2025-09-10 11:50:23 + [2025-09-10 05:21:57] iteration 7779/ 11920 | consumed samples: 7965696 | elapsed time per iteration (ms): 5629.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851905E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:28:31.297084 | finish at 2025-09-10 11:50:28 + [2025-09-10 05:22:02] iteration 7780/ 11920 | consumed samples: 7966720 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845489E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:27:52.351699 | finish at 2025-09-10 11:49:55 + [2025-09-10 05:22:08] iteration 7781/ 11920 | consumed samples: 7967744 | elapsed time per iteration (ms): 5631.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838650E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:28:27.193700 | finish at 2025-09-10 11:50:35 + [2025-09-10 05:22:14] iteration 7782/ 11920 | consumed samples: 7968768 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848283E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:27:56.945536 | finish at 2025-09-10 11:50:11 + [2025-09-10 05:22:19] iteration 7783/ 11920 | consumed samples: 7969792 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848514E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:27:53.735909 | finish at 2025-09-10 11:50:13 + [2025-09-10 05:22:25] iteration 7784/ 11920 | consumed samples: 7970816 | elapsed time per iteration (ms): 5617.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841703E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:27:14.371758 | finish at 2025-09-10 11:49:39 + [2025-09-10 05:22:30] iteration 7785/ 11920 | consumed samples: 7971840 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857649E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:27:49.054182 | finish at 2025-09-10 11:50:19 + [2025-09-10 05:22:36] iteration 7786/ 11920 | consumed samples: 7972864 | elapsed time per iteration (ms): 5618.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840502E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:27:07.074130 | finish at 2025-09-10 11:49:43 + [2025-09-10 05:22:42] iteration 7787/ 11920 | consumed samples: 7973888 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841508E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:27:26.882431 | finish at 2025-09-10 11:50:09 + [2025-09-10 05:22:47] iteration 7788/ 11920 | consumed samples: 7974912 | elapsed time per iteration (ms): 5631.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849235E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:27:49.116663 | finish at 2025-09-10 11:50:36 + [2025-09-10 05:22:53] iteration 7789/ 11920 | consumed samples: 7975936 | elapsed time per iteration (ms): 5633.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.847890E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:27:50.564734 | finish at 2025-09-10 11:50:44 + [2025-09-10 05:22:59] iteration 7790/ 11920 | consumed samples: 7976960 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826726E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:27:05.136192 | finish at 2025-09-10 11:50:04 + [2025-09-10 05:23:04] iteration 7791/ 11920 | consumed samples: 7977984 | elapsed time per iteration (ms): 5617.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834055E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:26:33.843652 | finish at 2025-09-10 11:49:38 + [2025-09-10 05:23:10] iteration 7792/ 11920 | consumed samples: 7979008 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850562E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:26:45.708549 | finish at 2025-09-10 11:49:56 + [2025-09-10 05:23:15] iteration 7793/ 11920 | consumed samples: 7980032 | elapsed time per iteration (ms): 5621.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841028E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:26:37.864261 | finish at 2025-09-10 11:49:53 + [2025-09-10 05:23:21] iteration 7794/ 11920 | consumed samples: 7981056 | elapsed time per iteration (ms): 5629.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837796E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:27:06.243405 | finish at 2025-09-10 11:50:27 + [2025-09-10 05:23:27] iteration 7795/ 11920 | consumed samples: 7982080 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831016E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:26:20.328012 | finish at 2025-09-10 11:49:47 + [2025-09-10 05:23:32] iteration 7796/ 11920 | consumed samples: 7983104 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841508E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:26:16.893293 | finish at 2025-09-10 11:49:49 + [2025-09-10 05:23:38] iteration 7797/ 11920 | consumed samples: 7984128 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842548E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:26:18.333195 | finish at 2025-09-10 11:49:56 + [2025-09-10 05:23:44] iteration 7798/ 11920 | consumed samples: 7985152 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850253E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:26:27.462728 | finish at 2025-09-10 11:50:11 + [2025-09-10 05:23:49] iteration 7799/ 11920 | consumed samples: 7986176 | elapsed time per iteration (ms): 5631.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830597E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:26:46.812185 | finish at 2025-09-10 11:50:36 + [2025-09-10 05:23:55] iteration 7800/ 11920 | consumed samples: 7987200 | elapsed time per iteration (ms): 6193.7 | throughput per GPU (TFLOP/s/GPU): 72.9 | MFU 7.37% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852747E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:05:18.072796 | finish at 2025-09-10 12:29:13 + [2025-09-10 05:24:01] iteration 7801/ 11920 | consumed samples: 7988224 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836347E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:25:50.310539 | finish at 2025-09-10 11:49:51 + [2025-09-10 05:24:07] iteration 7802/ 11920 | consumed samples: 7989248 | elapsed time per iteration (ms): 5957.4 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841924E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:48:52.501872 | finish at 2025-09-10 12:12:59 + [2025-09-10 05:24:13] iteration 7803/ 11920 | consumed samples: 7990272 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830220E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:25:45.432328 | finish at 2025-09-10 11:49:58 + [2025-09-10 05:24:18] iteration 7804/ 11920 | consumed samples: 7991296 | elapsed time per iteration (ms): 5626.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854601E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:25:56.424342 | finish at 2025-09-10 11:50:15 + [2025-09-10 05:24:24] iteration 7805/ 11920 | consumed samples: 7992320 | elapsed time per iteration (ms): 5636.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827499E+00 | loss scale: 1.0 | grad norm: 0.270 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:26:32.639039 | finish at 2025-09-10 11:50:56 + [2025-09-10 05:24:29] iteration 7806/ 11920 | consumed samples: 7993344 | elapsed time per iteration (ms): 5626.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846086E+00 | loss scale: 1.0 | grad norm: 0.318 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:25:46.144462 | finish at 2025-09-10 11:50:16 + [2025-09-10 05:24:35] iteration 7807/ 11920 | consumed samples: 7994368 | elapsed time per iteration (ms): 5626.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850738E+00 | loss scale: 1.0 | grad norm: 0.328 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:25:40.727143 | finish at 2025-09-10 11:50:16 + [2025-09-10 05:24:41] iteration 7808/ 11920 | consumed samples: 7995392 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859978E+00 | loss scale: 1.0 | grad norm: 0.350 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:25:26.628483 | finish at 2025-09-10 11:50:07 + [2025-09-10 05:24:46] iteration 7809/ 11920 | consumed samples: 7996416 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845064E+00 | loss scale: 1.0 | grad norm: 0.313 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:25:23.440928 | finish at 2025-09-10 11:50:10 + [2025-09-10 05:24:52] iteration 7810/ 11920 | consumed samples: 7997440 | elapsed time per iteration (ms): 5634.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850952E+00 | loss scale: 1.0 | grad norm: 0.292 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:25:58.200788 | finish at 2025-09-10 11:50:50 + [2025-09-10 05:24:58] iteration 7811/ 11920 | consumed samples: 7998464 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852681E+00 | loss scale: 1.0 | grad norm: 0.277 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:25:14.165401 | finish at 2025-09-10 11:50:12 + [2025-09-10 05:25:03] iteration 7812/ 11920 | consumed samples: 7999488 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841414E+00 | loss scale: 1.0 | grad norm: 0.268 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:24:56.172967 | finish at 2025-09-10 11:49:59 + [2025-09-10 05:25:09] iteration 7813/ 11920 | consumed samples: 8000512 | elapsed time per iteration (ms): 5627.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862387E+00 | loss scale: 1.0 | grad norm: 0.280 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:25:10.233324 | finish at 2025-09-10 11:50:19 + [2025-09-10 05:25:14] iteration 7814/ 11920 | consumed samples: 8001536 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855708E+00 | loss scale: 1.0 | grad norm: 0.272 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:24:59.326829 | finish at 2025-09-10 11:50:14 + [2025-09-10 05:25:20] iteration 7815/ 11920 | consumed samples: 8002560 | elapsed time per iteration (ms): 5925.9 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854428E+00 | loss scale: 1.0 | grad norm: 0.302 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:45:25.904576 | finish at 2025-09-10 12:10:46 + [2025-09-10 05:25:27] iteration 7816/ 11920 | consumed samples: 8003584 | elapsed time per iteration (ms): 6323.2 | throughput per GPU (TFLOP/s/GPU): 71.4 | MFU 7.22% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.866606E+00 | loss scale: 1.0 | grad norm: 0.286 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:12:30.375566 | finish at 2025-09-10 12:37:57 + [2025-09-10 05:25:33] iteration 7817/ 11920 | consumed samples: 8004608 | elapsed time per iteration (ms): 6254.2 | throughput per GPU (TFLOP/s/GPU): 72.2 | MFU 7.30% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835911E+00 | loss scale: 1.0 | grad norm: 0.275 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:07:41.019698 | finish at 2025-09-10 12:33:14 + [2025-09-10 05:25:39] iteration 7818/ 11920 | consumed samples: 8005632 | elapsed time per iteration (ms): 5819.8 | throughput per GPU (TFLOP/s/GPU): 77.6 | MFU 7.84% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852087E+00 | loss scale: 1.0 | grad norm: 0.294 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:37:52.771268 | finish at 2025-09-10 12:03:32 + [2025-09-10 05:25:44] iteration 7819/ 11920 | consumed samples: 8006656 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859131E+00 | loss scale: 1.0 | grad norm: 0.280 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:24:21.266052 | finish at 2025-09-10 11:50:06 + [2025-09-10 05:25:50] iteration 7820/ 11920 | consumed samples: 8007680 | elapsed time per iteration (ms): 5643.8 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860763E+00 | loss scale: 1.0 | grad norm: 0.280 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:25:39.683700 | finish at 2025-09-10 11:51:30 + [2025-09-10 05:25:56] iteration 7821/ 11920 | consumed samples: 8008704 | elapsed time per iteration (ms): 5629.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862741E+00 | loss scale: 1.0 | grad norm: 0.259 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:24:33.593291 | finish at 2025-09-10 11:50:29 + [2025-09-10 05:26:02] iteration 7822/ 11920 | consumed samples: 8009728 | elapsed time per iteration (ms): 5870.4 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837695E+00 | loss scale: 1.0 | grad norm: 0.238 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:40:56.877508 | finish at 2025-09-10 12:06:58 + [2025-09-10 05:26:07] iteration 7823/ 11920 | consumed samples: 8010752 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850337E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:23:56.017186 | finish at 2025-09-10 11:50:03 + [2025-09-10 05:26:13] iteration 7824/ 11920 | consumed samples: 8011776 | elapsed time per iteration (ms): 5879.6 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858192E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:41:22.978516 | finish at 2025-09-10 12:07:36 + [2025-09-10 05:26:19] iteration 7825/ 11920 | consumed samples: 8012800 | elapsed time per iteration (ms): 6066.9 | throughput per GPU (TFLOP/s/GPU): 74.4 | MFU 7.52% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838122E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:54:04.004377 | finish at 2025-09-10 12:20:23 + [2025-09-10 05:26:25] iteration 7826/ 11920 | consumed samples: 8013824 | elapsed time per iteration (ms): 5622.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844845E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:23:38.996952 | finish at 2025-09-10 11:50:04 + [2025-09-10 05:26:30] iteration 7827/ 11920 | consumed samples: 8014848 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843247E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:23:36.092069 | finish at 2025-09-10 11:50:06 + [2025-09-10 05:26:36] iteration 7828/ 11920 | consumed samples: 8015872 | elapsed time per iteration (ms): 5898.6 | throughput per GPU (TFLOP/s/GPU): 76.5 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861912E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:42:16.937957 | finish at 2025-09-10 12:08:53 + [2025-09-10 05:26:42] iteration 7829/ 11920 | consumed samples: 8016896 | elapsed time per iteration (ms): 5615.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835969E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:22:52.264232 | finish at 2025-09-10 11:49:34 + [2025-09-10 05:26:48] iteration 7830/ 11920 | consumed samples: 8017920 | elapsed time per iteration (ms): 5621.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836620E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:23:09.917514 | finish at 2025-09-10 11:49:57 + [2025-09-10 05:26:54] iteration 7831/ 11920 | consumed samples: 8018944 | elapsed time per iteration (ms): 6271.2 | throughput per GPU (TFLOP/s/GPU): 72.0 | MFU 7.28% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836819E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:07:23.009678 | finish at 2025-09-10 12:34:17 + [2025-09-10 05:26:59] iteration 7832/ 11920 | consumed samples: 8019968 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854132E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:23:01.211554 | finish at 2025-09-10 11:50:01 + [2025-09-10 05:27:05] iteration 7833/ 11920 | consumed samples: 8020992 | elapsed time per iteration (ms): 5619.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.847833E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:22:46.243321 | finish at 2025-09-10 11:49:51 + [2025-09-10 05:27:11] iteration 7834/ 11920 | consumed samples: 8022016 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842662E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:22:41.427678 | finish at 2025-09-10 11:49:52 + [2025-09-10 05:27:16] iteration 7835/ 11920 | consumed samples: 8023040 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835919E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:22:52.576464 | finish at 2025-09-10 11:50:09 + [2025-09-10 05:27:22] iteration 7836/ 11920 | consumed samples: 8024064 | elapsed time per iteration (ms): 5627.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846244E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:23:00.619697 | finish at 2025-09-10 11:50:23 + [2025-09-10 05:27:28] iteration 7837/ 11920 | consumed samples: 8025088 | elapsed time per iteration (ms): 5626.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851591E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:22:52.579494 | finish at 2025-09-10 11:50:20 + [2025-09-10 05:27:33] iteration 7838/ 11920 | consumed samples: 8026112 | elapsed time per iteration (ms): 5629.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862343E+00 | loss scale: 1.0 | grad norm: 0.136 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:22:57.647862 | finish at 2025-09-10 11:50:31 + [2025-09-10 05:27:39] iteration 7839/ 11920 | consumed samples: 8027136 | elapsed time per iteration (ms): 6149.7 | throughput per GPU (TFLOP/s/GPU): 73.4 | MFU 7.42% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851528E+00 | loss scale: 1.0 | grad norm: 0.135 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:58:16.994482 | finish at 2025-09-10 12:25:56 + [2025-09-10 05:27:45] iteration 7840/ 11920 | consumed samples: 8028160 | elapsed time per iteration (ms): 5630.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841667E+00 | loss scale: 1.0 | grad norm: 0.131 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:22:52.645569 | finish at 2025-09-10 11:50:38 + [2025-09-10 05:27:51] iteration 7841/ 11920 | consumed samples: 8029184 | elapsed time per iteration (ms): 5625.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834002E+00 | loss scale: 1.0 | grad norm: 0.137 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:22:25.867802 | finish at 2025-09-10 11:50:16 + [2025-09-10 05:27:56] iteration 7842/ 11920 | consumed samples: 8030208 | elapsed time per iteration (ms): 5626.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834729E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:22:22.895763 | finish at 2025-09-10 11:50:19 + [2025-09-10 05:28:02] iteration 7843/ 11920 | consumed samples: 8031232 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832735E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:22:15.173073 | finish at 2025-09-10 11:50:17 + [2025-09-10 05:28:07] iteration 7844/ 11920 | consumed samples: 8032256 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834650E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:22:02.314507 | finish at 2025-09-10 11:50:10 + [2025-09-10 05:28:13] iteration 7845/ 11920 | consumed samples: 8033280 | elapsed time per iteration (ms): 5616.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832595E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:21:27.070960 | finish at 2025-09-10 11:49:40 + [2025-09-10 05:28:19] iteration 7846/ 11920 | consumed samples: 8034304 | elapsed time per iteration (ms): 5935.4 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850498E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:43:00.744846 | finish at 2025-09-10 12:11:20 + [2025-09-10 05:28:25] iteration 7847/ 11920 | consumed samples: 8035328 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841012E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:21:55.057004 | finish at 2025-09-10 11:50:20 + [2025-09-10 05:28:30] iteration 7848/ 11920 | consumed samples: 8036352 | elapsed time per iteration (ms): 5618.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843894E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:21:17.783459 | finish at 2025-09-10 11:49:48 + [2025-09-10 05:28:36] iteration 7849/ 11920 | consumed samples: 8037376 | elapsed time per iteration (ms): 5621.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843995E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:21:22.990267 | finish at 2025-09-10 11:49:59 + [2025-09-10 05:28:41] iteration 7850/ 11920 | consumed samples: 8038400 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842981E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:21:20.857749 | finish at 2025-09-10 11:50:02 + [2025-09-10 05:28:47] iteration 7851/ 11920 | consumed samples: 8039424 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846248E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:21:11.619290 | finish at 2025-09-10 11:49:59 + [2025-09-10 05:28:53] iteration 7852/ 11920 | consumed samples: 8040448 | elapsed time per iteration (ms): 5858.2 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850942E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:37:11.261676 | finish at 2025-09-10 12:06:04 + [2025-09-10 05:28:59] iteration 7853/ 11920 | consumed samples: 8041472 | elapsed time per iteration (ms): 5618.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836112E+00 | loss scale: 1.0 | grad norm: 0.135 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:20:51.956976 | finish at 2025-09-10 11:49:51 + [2025-09-10 05:29:04] iteration 7854/ 11920 | consumed samples: 8042496 | elapsed time per iteration (ms): 5617.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831501E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:20:41.604475 | finish at 2025-09-10 11:49:46 + [2025-09-10 05:29:10] iteration 7855/ 11920 | consumed samples: 8043520 | elapsed time per iteration (ms): 5829.0 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831063E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:34:54.758992 | finish at 2025-09-10 12:04:05 + [2025-09-10 05:29:16] iteration 7856/ 11920 | consumed samples: 8044544 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829309E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:20:40.901360 | finish at 2025-09-10 11:49:57 + [2025-09-10 05:29:21] iteration 7857/ 11920 | consumed samples: 8045568 | elapsed time per iteration (ms): 5628.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841021E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:21:07.529874 | finish at 2025-09-10 11:50:29 + [2025-09-10 05:29:27] iteration 7858/ 11920 | consumed samples: 8046592 | elapsed time per iteration (ms): 6055.7 | throughput per GPU (TFLOP/s/GPU): 74.6 | MFU 7.54% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837588E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:49:58.202331 | finish at 2025-09-10 12:19:26 + [2025-09-10 05:29:33] iteration 7859/ 11920 | consumed samples: 8047616 | elapsed time per iteration (ms): 5615.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846556E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:20:04.115503 | finish at 2025-09-10 11:49:37 + [2025-09-10 05:29:39] iteration 7860/ 11920 | consumed samples: 8048640 | elapsed time per iteration (ms): 5617.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848188E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:20:07.283554 | finish at 2025-09-10 11:49:46 + [2025-09-10 05:29:44] iteration 7861/ 11920 | consumed samples: 8049664 | elapsed time per iteration (ms): 5616.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840711E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:19:57.415678 | finish at 2025-09-10 11:49:42 + [2025-09-10 05:29:50] iteration 7862/ 11920 | consumed samples: 8050688 | elapsed time per iteration (ms): 5621.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850737E+00 | loss scale: 1.0 | grad norm: 0.242 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:20:10.149789 | finish at 2025-09-10 11:50:00 + [2025-09-10 05:29:55] iteration 7863/ 11920 | consumed samples: 8051712 | elapsed time per iteration (ms): 5631.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844049E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:20:46.354227 | finish at 2025-09-10 11:50:42 + [2025-09-10 05:30:01] iteration 7864/ 11920 | consumed samples: 8052736 | elapsed time per iteration (ms): 6043.9 | throughput per GPU (TFLOP/s/GPU): 74.7 | MFU 7.55% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848803E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:48:34.136782 | finish at 2025-09-10 12:18:36 + [2025-09-10 05:30:07] iteration 7865/ 11920 | consumed samples: 8053760 | elapsed time per iteration (ms): 5965.3 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844969E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:43:09.362413 | finish at 2025-09-10 12:13:17 + [2025-09-10 05:30:13] iteration 7866/ 11920 | consumed samples: 8054784 | elapsed time per iteration (ms): 5940.0 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851021E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:41:20.817258 | finish at 2025-09-10 12:11:34 + [2025-09-10 05:30:19] iteration 7867/ 11920 | consumed samples: 8055808 | elapsed time per iteration (ms): 6007.6 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835757E+00 | loss scale: 1.0 | grad norm: 0.250 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:45:48.888115 | finish at 2025-09-10 12:16:08 + [2025-09-10 05:30:25] iteration 7868/ 11920 | consumed samples: 8056832 | elapsed time per iteration (ms): 5629.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844348E+00 | loss scale: 1.0 | grad norm: 0.240 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:20:12.074553 | finish at 2025-09-10 11:50:37 + [2025-09-10 05:30:31] iteration 7869/ 11920 | consumed samples: 8057856 | elapsed time per iteration (ms): 5621.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856507E+00 | loss scale: 1.0 | grad norm: 0.244 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:19:30.582352 | finish at 2025-09-10 11:50:01 + [2025-09-10 05:30:36] iteration 7870/ 11920 | consumed samples: 8058880 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849223E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:19:24.206278 | finish at 2025-09-10 11:50:00 + [2025-09-10 05:30:42] iteration 7871/ 11920 | consumed samples: 8059904 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840106E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:19:34.664470 | finish at 2025-09-10 11:50:17 + [2025-09-10 05:30:47] iteration 7872/ 11920 | consumed samples: 8060928 | elapsed time per iteration (ms): 5619.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849405E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:19:05.680946 | finish at 2025-09-10 11:49:53 + [2025-09-10 05:30:53] iteration 7873/ 11920 | consumed samples: 8061952 | elapsed time per iteration (ms): 5635.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840518E+00 | loss scale: 1.0 | grad norm: 0.230 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:20:06.226670 | finish at 2025-09-10 11:50:59 + [2025-09-10 05:30:59] iteration 7874/ 11920 | consumed samples: 8062976 | elapsed time per iteration (ms): 5616.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854065E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:18:44.736738 | finish at 2025-09-10 11:49:43 + [2025-09-10 05:31:04] iteration 7875/ 11920 | consumed samples: 8064000 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840634E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:18:58.340700 | finish at 2025-09-10 11:50:03 + [2025-09-10 05:31:10] iteration 7876/ 11920 | consumed samples: 8065024 | elapsed time per iteration (ms): 5618.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851417E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:18:41.328712 | finish at 2025-09-10 11:49:51 + [2025-09-10 05:31:16] iteration 7877/ 11920 | consumed samples: 8066048 | elapsed time per iteration (ms): 5618.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845913E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:18:34.028133 | finish at 2025-09-10 11:49:50 + [2025-09-10 05:31:21] iteration 7878/ 11920 | consumed samples: 8067072 | elapsed time per iteration (ms): 5632.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841094E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 12.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:19:27.200764 | finish at 2025-09-10 11:50:48 + [2025-09-10 05:31:27] iteration 7879/ 11920 | consumed samples: 8068096 | elapsed time per iteration (ms): 5619.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840550E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:18:27.822078 | finish at 2025-09-10 11:49:55 + [2025-09-10 05:31:33] iteration 7880/ 11920 | consumed samples: 8069120 | elapsed time per iteration (ms): 5844.5 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838480E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:33:31.861172 | finish at 2025-09-10 12:05:05 + [2025-09-10 05:31:38] iteration 7881/ 11920 | consumed samples: 8070144 | elapsed time per iteration (ms): 5627.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850143E+00 | loss scale: 1.0 | grad norm: 0.135 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:18:50.674521 | finish at 2025-09-10 11:50:29 + [2025-09-10 05:31:44] iteration 7882/ 11920 | consumed samples: 8071168 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839471E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:18:12.822083 | finish at 2025-09-10 11:49:57 + [2025-09-10 05:31:50] iteration 7883/ 11920 | consumed samples: 8072192 | elapsed time per iteration (ms): 5619.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831329E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:18:03.759419 | finish at 2025-09-10 11:49:53 + [2025-09-10 05:31:55] iteration 7884/ 11920 | consumed samples: 8073216 | elapsed time per iteration (ms): 5630.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834124E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:18:44.363450 | finish at 2025-09-10 11:50:40 + [2025-09-10 05:32:01] iteration 7885/ 11920 | consumed samples: 8074240 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843898E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:18:10.711344 | finish at 2025-09-10 11:50:12 + [2025-09-10 05:32:06] iteration 7886/ 11920 | consumed samples: 8075264 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845320E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:17:59.428755 | finish at 2025-09-10 11:50:06 + [2025-09-10 05:32:12] iteration 7887/ 11920 | consumed samples: 8076288 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842088E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:17:49.855709 | finish at 2025-09-10 11:50:02 + [2025-09-10 05:32:18] iteration 7888/ 11920 | consumed samples: 8077312 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819533E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:17:53.373734 | finish at 2025-09-10 11:50:11 + [2025-09-10 05:32:23] iteration 7889/ 11920 | consumed samples: 8078336 | elapsed time per iteration (ms): 5623.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836021E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:17:46.497148 | finish at 2025-09-10 11:50:10 + [2025-09-10 05:32:29] iteration 7890/ 11920 | consumed samples: 8079360 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843752E+00 | loss scale: 1.0 | grad norm: 0.260 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:17:28.554380 | finish at 2025-09-10 11:49:57 + [2025-09-10 05:32:35] iteration 7891/ 11920 | consumed samples: 8080384 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860523E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:17:22.469466 | finish at 2025-09-10 11:49:57 + [2025-09-10 05:32:40] iteration 7892/ 11920 | consumed samples: 8081408 | elapsed time per iteration (ms): 5618.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.853011E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:17:10.369151 | finish at 2025-09-10 11:49:51 + [2025-09-10 05:32:46] iteration 7893/ 11920 | consumed samples: 8082432 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836067E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:17:13.475421 | finish at 2025-09-10 11:49:59 + [2025-09-10 05:32:51] iteration 7894/ 11920 | consumed samples: 8083456 | elapsed time per iteration (ms): 5634.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849926E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:18:03.605385 | finish at 2025-09-10 11:50:55 + [2025-09-10 05:32:57] iteration 7895/ 11920 | consumed samples: 8084480 | elapsed time per iteration (ms): 5629.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846626E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:17:37.514572 | finish at 2025-09-10 11:50:35 + [2025-09-10 05:33:03] iteration 7896/ 11920 | consumed samples: 8085504 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842090E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:17:12.502691 | finish at 2025-09-10 11:50:15 + [2025-09-10 05:33:08] iteration 7897/ 11920 | consumed samples: 8086528 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837832E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:17:14.730938 | finish at 2025-09-10 11:50:23 + [2025-09-10 05:33:14] iteration 7898/ 11920 | consumed samples: 8087552 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846574E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:16:58.124978 | finish at 2025-09-10 11:50:12 + [2025-09-10 05:33:20] iteration 7899/ 11920 | consumed samples: 8088576 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854714E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:16:43.341179 | finish at 2025-09-10 11:50:03 + [2025-09-10 05:33:25] iteration 7900/ 11920 | consumed samples: 8089600 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842870E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:16:51.107383 | finish at 2025-09-10 11:50:16 + [2025-09-10 05:33:31] iteration 7901/ 11920 | consumed samples: 8090624 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848130E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:16:46.140057 | finish at 2025-09-10 11:50:17 + [2025-09-10 05:33:36] iteration 7902/ 11920 | consumed samples: 8091648 | elapsed time per iteration (ms): 5627.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844474E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:16:51.659257 | finish at 2025-09-10 11:50:28 + [2025-09-10 05:33:42] iteration 7903/ 11920 | consumed samples: 8092672 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.847770E+00 | loss scale: 1.0 | grad norm: 0.259 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:16:19.829201 | finish at 2025-09-10 11:50:02 + [2025-09-10 05:33:48] iteration 7904/ 11920 | consumed samples: 8093696 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829762E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:16:13.085957 | finish at 2025-09-10 11:50:01 + [2025-09-10 05:33:53] iteration 7905/ 11920 | consumed samples: 8094720 | elapsed time per iteration (ms): 5638.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852297E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:17:17.159712 | finish at 2025-09-10 11:51:10 + [2025-09-10 05:33:59] iteration 7906/ 11920 | consumed samples: 8095744 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844307E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:16:16.418718 | finish at 2025-09-10 11:50:15 + [2025-09-10 05:34:05] iteration 7907/ 11920 | consumed samples: 8096768 | elapsed time per iteration (ms): 5616.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846439E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:15:39.533630 | finish at 2025-09-10 11:49:44 + [2025-09-10 05:34:10] iteration 7908/ 11920 | consumed samples: 8097792 | elapsed time per iteration (ms): 5619.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856636E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:15:45.972216 | finish at 2025-09-10 11:49:56 + [2025-09-10 05:34:16] iteration 7909/ 11920 | consumed samples: 8098816 | elapsed time per iteration (ms): 5617.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840501E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:15:33.328581 | finish at 2025-09-10 11:49:49 + [2025-09-10 05:34:22] iteration 7910/ 11920 | consumed samples: 8099840 | elapsed time per iteration (ms): 5874.3 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826945E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:32:35.797691 | finish at 2025-09-10 12:06:57 + [2025-09-10 05:34:27] iteration 7911/ 11920 | consumed samples: 8100864 | elapsed time per iteration (ms): 5629.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856897E+00 | loss scale: 1.0 | grad norm: 0.238 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:16:08.497880 | finish at 2025-09-10 11:50:36 + [2025-09-10 05:34:33] iteration 7912/ 11920 | consumed samples: 8101888 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846927E+00 | loss scale: 1.0 | grad norm: 0.253 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:15:38.333862 | finish at 2025-09-10 11:50:11 + [2025-09-10 05:34:39] iteration 7913/ 11920 | consumed samples: 8102912 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837982E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:15:32.669446 | finish at 2025-09-10 11:50:11 + [2025-09-10 05:34:44] iteration 7914/ 11920 | consumed samples: 8103936 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845947E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:15:28.686034 | finish at 2025-09-10 11:50:13 + [2025-09-10 05:34:50] iteration 7915/ 11920 | consumed samples: 8104960 | elapsed time per iteration (ms): 5645.4 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844948E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:16:49.777536 | finish at 2025-09-10 11:51:40 + [2025-09-10 05:34:55] iteration 7916/ 11920 | consumed samples: 8105984 | elapsed time per iteration (ms): 5632.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852585E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:15:53.249522 | finish at 2025-09-10 11:50:49 + [2025-09-10 05:35:01] iteration 7917/ 11920 | consumed samples: 8107008 | elapsed time per iteration (ms): 5627.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827317E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:15:27.815168 | finish at 2025-09-10 11:50:29 + [2025-09-10 05:35:07] iteration 7918/ 11920 | consumed samples: 8108032 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839133E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:14:52.854920 | finish at 2025-09-10 11:50:00 + [2025-09-10 05:35:12] iteration 7919/ 11920 | consumed samples: 8109056 | elapsed time per iteration (ms): 5617.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839739E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:14:37.351980 | finish at 2025-09-10 11:49:50 + [2025-09-10 05:35:18] iteration 7920/ 11920 | consumed samples: 8110080 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836325E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:14:46.471176 | finish at 2025-09-10 11:50:04 + [2025-09-10 05:35:24] iteration 7921/ 11920 | consumed samples: 8111104 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835862E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:15:03.382109 | finish at 2025-09-10 11:50:27 + [2025-09-10 05:35:29] iteration 7922/ 11920 | consumed samples: 8112128 | elapsed time per iteration (ms): 5618.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852885E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:14:20.661177 | finish at 2025-09-10 11:49:50 + [2025-09-10 05:35:35] iteration 7923/ 11920 | consumed samples: 8113152 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857434E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:14:29.355695 | finish at 2025-09-10 11:50:04 + [2025-09-10 05:35:40] iteration 7924/ 11920 | consumed samples: 8114176 | elapsed time per iteration (ms): 5630.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837231E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:14:58.958127 | finish at 2025-09-10 11:50:39 + [2025-09-10 05:35:46] iteration 7925/ 11920 | consumed samples: 8115200 | elapsed time per iteration (ms): 5632.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823351E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:15:02.656368 | finish at 2025-09-10 11:50:49 + [2025-09-10 05:35:52] iteration 7926/ 11920 | consumed samples: 8116224 | elapsed time per iteration (ms): 5626.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.847219E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:14:31.120727 | finish at 2025-09-10 11:50:23 + [2025-09-10 05:35:58] iteration 7927/ 11920 | consumed samples: 8117248 | elapsed time per iteration (ms): 5950.3 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835158E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:35:59.439285 | finish at 2025-09-10 12:11:57 + [2025-09-10 05:36:03] iteration 7928/ 11920 | consumed samples: 8118272 | elapsed time per iteration (ms): 5617.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826881E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:13:45.633230 | finish at 2025-09-10 11:49:49 + [2025-09-10 05:36:09] iteration 7929/ 11920 | consumed samples: 8119296 | elapsed time per iteration (ms): 5969.7 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835677E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:37:04.906574 | finish at 2025-09-10 12:13:14 + [2025-09-10 05:36:15] iteration 7930/ 11920 | consumed samples: 8120320 | elapsed time per iteration (ms): 5810.6 | throughput per GPU (TFLOP/s/GPU): 77.7 | MFU 7.86% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818112E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:26:24.360759 | finish at 2025-09-10 12:02:39 + [2025-09-10 05:36:21] iteration 7931/ 11920 | consumed samples: 8121344 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824800E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:13:53.532371 | finish at 2025-09-10 11:50:14 + [2025-09-10 05:36:26] iteration 7932/ 11920 | consumed samples: 8122368 | elapsed time per iteration (ms): 5628.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842205E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:14:05.136309 | finish at 2025-09-10 11:50:31 + [2025-09-10 05:36:32] iteration 7933/ 11920 | consumed samples: 8123392 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851729E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:13:49.720071 | finish at 2025-09-10 11:50:22 + [2025-09-10 05:36:38] iteration 7934/ 11920 | consumed samples: 8124416 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841842E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:13:20.423377 | finish at 2025-09-10 11:49:58 + [2025-09-10 05:36:43] iteration 7935/ 11920 | consumed samples: 8125440 | elapsed time per iteration (ms): 5616.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839268E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:12:59.933617 | finish at 2025-09-10 11:49:43 + [2025-09-10 05:36:49] iteration 7936/ 11920 | consumed samples: 8126464 | elapsed time per iteration (ms): 5626.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827687E+00 | loss scale: 1.0 | grad norm: 0.242 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:13:36.547382 | finish at 2025-09-10 11:50:25 + [2025-09-10 05:36:54] iteration 7937/ 11920 | consumed samples: 8127488 | elapsed time per iteration (ms): 5630.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846239E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:13:46.826894 | finish at 2025-09-10 11:50:41 + [2025-09-10 05:37:00] iteration 7938/ 11920 | consumed samples: 8128512 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829444E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:13:15.600897 | finish at 2025-09-10 11:50:16 + [2025-09-10 05:37:06] iteration 7939/ 11920 | consumed samples: 8129536 | elapsed time per iteration (ms): 5626.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838687E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:13:19.532674 | finish at 2025-09-10 11:50:25 + [2025-09-10 05:37:11] iteration 7940/ 11920 | consumed samples: 8130560 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843900E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:13:04.582114 | finish at 2025-09-10 11:50:16 + [2025-09-10 05:37:17] iteration 7941/ 11920 | consumed samples: 8131584 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845705E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:13:12.010564 | finish at 2025-09-10 11:50:29 + [2025-09-10 05:37:23] iteration 7942/ 11920 | consumed samples: 8132608 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835493E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:12:45.829609 | finish at 2025-09-10 11:50:08 + [2025-09-10 05:37:28] iteration 7943/ 11920 | consumed samples: 8133632 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824502E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:12:55.059688 | finish at 2025-09-10 11:50:23 + [2025-09-10 05:37:34] iteration 7944/ 11920 | consumed samples: 8134656 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835685E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:12:44.078590 | finish at 2025-09-10 11:50:18 + [2025-09-10 05:37:39] iteration 7945/ 11920 | consumed samples: 8135680 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850201E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:12:24.369847 | finish at 2025-09-10 11:50:04 + [2025-09-10 05:37:45] iteration 7946/ 11920 | consumed samples: 8136704 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834883E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:12:23.041633 | finish at 2025-09-10 11:50:08 + [2025-09-10 05:37:51] iteration 7947/ 11920 | consumed samples: 8137728 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824406E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:12:21.008408 | finish at 2025-09-10 11:50:12 + [2025-09-10 05:37:56] iteration 7948/ 11920 | consumed samples: 8138752 | elapsed time per iteration (ms): 5627.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831037E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:12:30.397968 | finish at 2025-09-10 11:50:27 + [2025-09-10 05:38:02] iteration 7949/ 11920 | consumed samples: 8139776 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827767E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:11:59.558756 | finish at 2025-09-10 11:50:01 + [2025-09-10 05:38:08] iteration 7950/ 11920 | consumed samples: 8140800 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839477E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:12:05.905938 | finish at 2025-09-10 11:50:13 + [2025-09-10 05:38:13] iteration 7951/ 11920 | consumed samples: 8141824 | elapsed time per iteration (ms): 5849.6 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830243E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:26:57.138407 | finish at 2025-09-10 12:05:11 + [2025-09-10 05:38:19] iteration 7952/ 11920 | consumed samples: 8142848 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838663E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:11:59.850525 | finish at 2025-09-10 11:50:19 + [2025-09-10 05:38:25] iteration 7953/ 11920 | consumed samples: 8143872 | elapsed time per iteration (ms): 5636.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842124E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:12:40.568190 | finish at 2025-09-10 11:51:05 + [2025-09-10 05:38:31] iteration 7954/ 11920 | consumed samples: 8144896 | elapsed time per iteration (ms): 5944.5 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825465E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:32:55.770574 | finish at 2025-09-10 12:11:26 + [2025-09-10 05:38:36] iteration 7955/ 11920 | consumed samples: 8145920 | elapsed time per iteration (ms): 5618.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831219E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:11:18.114413 | finish at 2025-09-10 11:49:54 + [2025-09-10 05:38:42] iteration 7956/ 11920 | consumed samples: 8146944 | elapsed time per iteration (ms): 5625.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838662E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:11:38.978123 | finish at 2025-09-10 11:50:21 + [2025-09-10 05:38:47] iteration 7957/ 11920 | consumed samples: 8147968 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838172E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:11:34.236187 | finish at 2025-09-10 11:50:22 + [2025-09-10 05:38:53] iteration 7958/ 11920 | consumed samples: 8148992 | elapsed time per iteration (ms): 5631.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835918E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:11:52.455493 | finish at 2025-09-10 11:50:46 + [2025-09-10 05:38:59] iteration 7959/ 11920 | consumed samples: 8150016 | elapsed time per iteration (ms): 5971.3 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837556E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:34:12.210248 | finish at 2025-09-10 12:13:11 + [2025-09-10 05:39:05] iteration 7960/ 11920 | consumed samples: 8151040 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841133E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:10:56.757374 | finish at 2025-09-10 11:50:01 + [2025-09-10 05:39:10] iteration 7961/ 11920 | consumed samples: 8152064 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848754E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:10:55.859308 | finish at 2025-09-10 11:50:06 + [2025-09-10 05:39:16] iteration 7962/ 11920 | consumed samples: 8153088 | elapsed time per iteration (ms): 5619.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839038E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:10:42.831872 | finish at 2025-09-10 11:49:59 + [2025-09-10 05:39:22] iteration 7963/ 11920 | consumed samples: 8154112 | elapsed time per iteration (ms): 5616.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828075E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:10:23.705180 | finish at 2025-09-10 11:49:45 + [2025-09-10 05:39:27] iteration 7964/ 11920 | consumed samples: 8155136 | elapsed time per iteration (ms): 5960.7 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836235E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:33:00.510475 | finish at 2025-09-10 12:12:28 + [2025-09-10 05:39:33] iteration 7965/ 11920 | consumed samples: 8156160 | elapsed time per iteration (ms): 5628.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831584E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:11:02.435486 | finish at 2025-09-10 11:50:36 + [2025-09-10 05:39:39] iteration 7966/ 11920 | consumed samples: 8157184 | elapsed time per iteration (ms): 6122.5 | throughput per GPU (TFLOP/s/GPU): 73.7 | MFU 7.46% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844198E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:43:28.518435 | finish at 2025-09-10 12:23:08 + [2025-09-10 05:39:45] iteration 7967/ 11920 | consumed samples: 8158208 | elapsed time per iteration (ms): 5616.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848106E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:10:00.416256 | finish at 2025-09-10 11:49:45 + [2025-09-10 05:39:50] iteration 7968/ 11920 | consumed samples: 8159232 | elapsed time per iteration (ms): 5629.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831590E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:10:45.620293 | finish at 2025-09-10 11:50:36 + [2025-09-10 05:39:56] iteration 7969/ 11920 | consumed samples: 8160256 | elapsed time per iteration (ms): 5839.3 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837794E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:24:31.058901 | finish at 2025-09-10 12:04:27 + [2025-09-10 05:40:02] iteration 7970/ 11920 | consumed samples: 8161280 | elapsed time per iteration (ms): 5829.6 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834229E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:23:46.732028 | finish at 2025-09-10 12:03:49 + [2025-09-10 05:40:08] iteration 7971/ 11920 | consumed samples: 8162304 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830750E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:10:00.492694 | finish at 2025-09-10 11:50:08 + [2025-09-10 05:40:13] iteration 7972/ 11920 | consumed samples: 8163328 | elapsed time per iteration (ms): 5615.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836515E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:09:28.091575 | finish at 2025-09-10 11:49:41 + [2025-09-10 05:40:19] iteration 7973/ 11920 | consumed samples: 8164352 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836529E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:09:56.799040 | finish at 2025-09-10 11:50:16 + [2025-09-10 05:40:25] iteration 7974/ 11920 | consumed samples: 8165376 | elapsed time per iteration (ms): 5634.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830592E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:10:35.476644 | finish at 2025-09-10 11:51:00 + [2025-09-10 05:40:30] iteration 7975/ 11920 | consumed samples: 8166400 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832500E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:09:38.503987 | finish at 2025-09-10 11:50:09 + [2025-09-10 05:40:36] iteration 7976/ 11920 | consumed samples: 8167424 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841337E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:09:27.657625 | finish at 2025-09-10 11:50:04 + [2025-09-10 05:40:42] iteration 7977/ 11920 | consumed samples: 8168448 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834945E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:09:36.193754 | finish at 2025-09-10 11:50:18 + [2025-09-10 05:40:48] iteration 7978/ 11920 | consumed samples: 8169472 | elapsed time per iteration (ms): 5982.4 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839443E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:33:02.565114 | finish at 2025-09-10 12:13:50 + [2025-09-10 05:40:54] iteration 7979/ 11920 | consumed samples: 8170496 | elapsed time per iteration (ms): 6123.6 | throughput per GPU (TFLOP/s/GPU): 73.7 | MFU 7.45% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845373E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:42:12.935676 | finish at 2025-09-10 12:23:07 + [2025-09-10 05:40:59] iteration 7980/ 11920 | consumed samples: 8171520 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829790E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:09:24.739456 | finish at 2025-09-10 11:50:24 + [2025-09-10 05:41:05] iteration 7981/ 11920 | consumed samples: 8172544 | elapsed time per iteration (ms): 6233.2 | throughput per GPU (TFLOP/s/GPU): 72.4 | MFU 7.32% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841234E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:49:12.731923 | finish at 2025-09-10 12:30:18 + [2025-09-10 05:41:12] iteration 7982/ 11920 | consumed samples: 8173568 | elapsed time per iteration (ms): 6596.0 | throughput per GPU (TFLOP/s/GPU): 68.4 | MFU 6.92% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830476E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 7:12:55.189600 | finish at 2025-09-10 12:54:07 + [2025-09-10 05:41:18] iteration 7983/ 11920 | consumed samples: 8174592 | elapsed time per iteration (ms): 6128.0 | throughput per GPU (TFLOP/s/GPU): 73.7 | MFU 7.45% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837551E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:42:05.977383 | finish at 2025-09-10 12:23:24 + [2025-09-10 05:41:25] iteration 7984/ 11920 | consumed samples: 8175616 | elapsed time per iteration (ms): 6355.3 | throughput per GPU (TFLOP/s/GPU): 71.0 | MFU 7.18% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825939E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:56:54.363007 | finish at 2025-09-10 12:38:19 + [2025-09-10 05:41:30] iteration 7985/ 11920 | consumed samples: 8176640 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844915E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:08:47.699870 | finish at 2025-09-10 11:50:18 + [2025-09-10 05:41:36] iteration 7986/ 11920 | consumed samples: 8177664 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.833714E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:08:43.507861 | finish at 2025-09-10 11:50:19 + [2025-09-10 05:41:41] iteration 7987/ 11920 | consumed samples: 8178688 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819714E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:08:25.409966 | finish at 2025-09-10 11:50:07 + [2025-09-10 05:41:47] iteration 7988/ 11920 | consumed samples: 8179712 | elapsed time per iteration (ms): 5619.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826548E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:08:15.087162 | finish at 2025-09-10 11:50:02 + [2025-09-10 05:41:53] iteration 7989/ 11920 | consumed samples: 8180736 | elapsed time per iteration (ms): 5628.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830429E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:08:47.326066 | finish at 2025-09-10 11:50:40 + [2025-09-10 05:41:59] iteration 7990/ 11920 | consumed samples: 8181760 | elapsed time per iteration (ms): 5905.7 | throughput per GPU (TFLOP/s/GPU): 76.4 | MFU 7.73% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830868E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:26:49.416804 | finish at 2025-09-10 12:08:48 + [2025-09-10 05:42:04] iteration 7991/ 11920 | consumed samples: 8182784 | elapsed time per iteration (ms): 5882.2 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846067E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:25:11.257485 | finish at 2025-09-10 12:07:16 + [2025-09-10 05:42:10] iteration 7992/ 11920 | consumed samples: 8183808 | elapsed time per iteration (ms): 5627.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854359E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:08:25.151749 | finish at 2025-09-10 11:50:35 + [2025-09-10 05:42:16] iteration 7993/ 11920 | consumed samples: 8184832 | elapsed time per iteration (ms): 5622.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836756E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:08:00.086268 | finish at 2025-09-10 11:50:16 + [2025-09-10 05:42:21] iteration 7994/ 11920 | consumed samples: 8185856 | elapsed time per iteration (ms): 5630.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.847939E+00 | loss scale: 1.0 | grad norm: 0.243 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:08:24.235046 | finish at 2025-09-10 11:50:46 + [2025-09-10 05:42:27] iteration 7995/ 11920 | consumed samples: 8186880 | elapsed time per iteration (ms): 5626.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826126E+00 | loss scale: 1.0 | grad norm: 0.269 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:08:02.953691 | finish at 2025-09-10 11:50:30 + [2025-09-10 05:42:33] iteration 7996/ 11920 | consumed samples: 8187904 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845231E+00 | loss scale: 1.0 | grad norm: 0.251 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:07:47.584597 | finish at 2025-09-10 11:50:20 + [2025-09-10 05:42:38] iteration 7997/ 11920 | consumed samples: 8188928 | elapsed time per iteration (ms): 5626.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841852E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:07:53.397894 | finish at 2025-09-10 11:50:32 + [2025-09-10 05:42:44] iteration 7998/ 11920 | consumed samples: 8189952 | elapsed time per iteration (ms): 5836.7 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846131E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:21:31.537253 | finish at 2025-09-10 12:04:16 + [2025-09-10 05:42:50] iteration 7999/ 11920 | consumed samples: 8190976 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838216E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:07:18.283732 | finish at 2025-09-10 11:50:08 + [2025-09-10 05:42:55] iteration 8000/ 11920 | consumed samples: 8192000 | elapsed time per iteration (ms): 5641.2 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841725E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:08:33.415470 | finish at 2025-09-10 11:51:29 + [2025-09-10 05:43:01] iteration 8001/ 11920 | consumed samples: 8193024 | elapsed time per iteration (ms): 5634.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836534E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:08:01.641110 | finish at 2025-09-10 11:51:03 + [2025-09-10 05:43:07] iteration 8002/ 11920 | consumed samples: 8194048 | elapsed time per iteration (ms): 5635.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835200E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:07:59.321807 | finish at 2025-09-10 11:51:06 + [2025-09-10 05:43:12] iteration 8003/ 11920 | consumed samples: 8195072 | elapsed time per iteration (ms): 5636.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844880E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:07:56.105216 | finish at 2025-09-10 11:51:08 + [2025-09-10 05:43:18] iteration 8004/ 11920 | consumed samples: 8196096 | elapsed time per iteration (ms): 5634.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828065E+00 | loss scale: 1.0 | grad norm: 0.274 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:07:42.778791 | finish at 2025-09-10 11:51:01 + [2025-09-10 05:43:24] iteration 8005/ 11920 | consumed samples: 8197120 | elapsed time per iteration (ms): 5646.5 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839924E+00 | loss scale: 1.0 | grad norm: 0.250 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:08:26.127273 | finish at 2025-09-10 11:51:50 + [2025-09-10 05:43:29] iteration 8006/ 11920 | consumed samples: 8198144 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848467E+00 | loss scale: 1.0 | grad norm: 0.242 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:06:56.257465 | finish at 2025-09-10 11:50:25 + [2025-09-10 05:43:35] iteration 8007/ 11920 | consumed samples: 8199168 | elapsed time per iteration (ms): 5837.0 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843450E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:20:39.993063 | finish at 2025-09-10 12:04:15 + [2025-09-10 05:43:41] iteration 8008/ 11920 | consumed samples: 8200192 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836621E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:06:41.539707 | finish at 2025-09-10 11:50:22 + [2025-09-10 05:43:46] iteration 8009/ 11920 | consumed samples: 8201216 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824058E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:06:27.240030 | finish at 2025-09-10 11:50:13 + [2025-09-10 05:43:52] iteration 8010/ 11920 | consumed samples: 8202240 | elapsed time per iteration (ms): 5631.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834185E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:06:59.461467 | finish at 2025-09-10 11:50:51 + [2025-09-10 05:43:57] iteration 8011/ 11920 | consumed samples: 8203264 | elapsed time per iteration (ms): 5625.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827911E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:06:29.672084 | finish at 2025-09-10 11:50:27 + [2025-09-10 05:44:03] iteration 8012/ 11920 | consumed samples: 8204288 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.833936E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:06:07.543713 | finish at 2025-09-10 11:50:11 + [2025-09-10 05:44:09] iteration 8013/ 11920 | consumed samples: 8205312 | elapsed time per iteration (ms): 5617.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830155E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:05:45.863456 | finish at 2025-09-10 11:49:55 + [2025-09-10 05:44:15] iteration 8014/ 11920 | consumed samples: 8206336 | elapsed time per iteration (ms): 6002.8 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841564E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:30:47.063404 | finish at 2025-09-10 12:15:02 + [2025-09-10 05:44:21] iteration 8015/ 11920 | consumed samples: 8207360 | elapsed time per iteration (ms): 5827.5 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851291E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:19:16.521977 | finish at 2025-09-10 12:03:37 + [2025-09-10 05:44:26] iteration 8016/ 11920 | consumed samples: 8208384 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829706E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:06:00.906586 | finish at 2025-09-10 11:50:27 + [2025-09-10 05:44:32] iteration 8017/ 11920 | consumed samples: 8209408 | elapsed time per iteration (ms): 5628.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839033E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:06:06.327885 | finish at 2025-09-10 11:50:38 + [2025-09-10 05:44:37] iteration 8018/ 11920 | consumed samples: 8210432 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849332E+00 | loss scale: 1.0 | grad norm: 0.251 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:05:50.073830 | finish at 2025-09-10 11:50:28 + [2025-09-10 05:44:43] iteration 8019/ 11920 | consumed samples: 8211456 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838894E+00 | loss scale: 1.0 | grad norm: 0.248 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:05:23.256826 | finish at 2025-09-10 11:50:06 + [2025-09-10 05:44:49] iteration 8020/ 11920 | consumed samples: 8212480 | elapsed time per iteration (ms): 5628.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836743E+00 | loss scale: 1.0 | grad norm: 0.272 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:05:50.409794 | finish at 2025-09-10 11:50:39 + [2025-09-10 05:44:55] iteration 8021/ 11920 | consumed samples: 8213504 | elapsed time per iteration (ms): 6193.7 | throughput per GPU (TFLOP/s/GPU): 72.9 | MFU 7.37% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843075E+00 | loss scale: 1.0 | grad norm: 0.248 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:42:29.386258 | finish at 2025-09-10 12:27:24 + [2025-09-10 05:45:00] iteration 8022/ 11920 | consumed samples: 8214528 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831870E+00 | loss scale: 1.0 | grad norm: 0.285 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:05:09.803194 | finish at 2025-09-10 11:50:10 + [2025-09-10 05:45:06] iteration 8023/ 11920 | consumed samples: 8215552 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852497E+00 | loss scale: 1.0 | grad norm: 0.276 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:05:08.659829 | finish at 2025-09-10 11:50:15 + [2025-09-10 05:45:12] iteration 8024/ 11920 | consumed samples: 8216576 | elapsed time per iteration (ms): 5853.2 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850107E+00 | loss scale: 1.0 | grad norm: 0.251 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:20:04.043818 | finish at 2025-09-10 12:05:16 + [2025-09-10 05:45:18] iteration 8025/ 11920 | consumed samples: 8217600 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838826E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:05:02.185466 | finish at 2025-09-10 11:50:20 + [2025-09-10 05:45:23] iteration 8026/ 11920 | consumed samples: 8218624 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844343E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:04:59.879492 | finish at 2025-09-10 11:50:23 + [2025-09-10 05:45:29] iteration 8027/ 11920 | consumed samples: 8219648 | elapsed time per iteration (ms): 5632.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842469E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:05:26.724503 | finish at 2025-09-10 11:50:56 + [2025-09-10 05:45:34] iteration 8028/ 11920 | consumed samples: 8220672 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839248E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:04:39.581427 | finish at 2025-09-10 11:50:14 + [2025-09-10 05:45:40] iteration 8029/ 11920 | consumed samples: 8221696 | elapsed time per iteration (ms): 5626.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823153E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:04:52.990311 | finish at 2025-09-10 11:50:33 + [2025-09-10 05:45:46] iteration 8030/ 11920 | consumed samples: 8222720 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839453E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:04:21.327484 | finish at 2025-09-10 11:50:07 + [2025-09-10 05:45:51] iteration 8031/ 11920 | consumed samples: 8223744 | elapsed time per iteration (ms): 5619.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836730E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:04:13.973723 | finish at 2025-09-10 11:50:05 + [2025-09-10 05:45:57] iteration 8032/ 11920 | consumed samples: 8224768 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841747E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:04:24.351963 | finish at 2025-09-10 11:50:21 + [2025-09-10 05:46:03] iteration 8033/ 11920 | consumed samples: 8225792 | elapsed time per iteration (ms): 5619.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834796E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:04:01.611657 | finish at 2025-09-10 11:50:04 + [2025-09-10 05:46:08] iteration 8034/ 11920 | consumed samples: 8226816 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844805E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:03:59.878232 | finish at 2025-09-10 11:50:08 + [2025-09-10 05:46:14] iteration 8035/ 11920 | consumed samples: 8227840 | elapsed time per iteration (ms): 5618.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837262E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:03:46.729478 | finish at 2025-09-10 11:50:01 + [2025-09-10 05:46:19] iteration 8036/ 11920 | consumed samples: 8228864 | elapsed time per iteration (ms): 5612.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828253E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:03:18.196962 | finish at 2025-09-10 11:49:38 + [2025-09-10 05:46:25] iteration 8037/ 11920 | consumed samples: 8229888 | elapsed time per iteration (ms): 5618.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.833352E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:03:36.404959 | finish at 2025-09-10 11:50:01 + [2025-09-10 05:46:31] iteration 8038/ 11920 | consumed samples: 8230912 | elapsed time per iteration (ms): 5971.8 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839650E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:26:22.601063 | finish at 2025-09-10 12:12:54 + [2025-09-10 05:46:37] iteration 8039/ 11920 | consumed samples: 8231936 | elapsed time per iteration (ms): 5628.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835926E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:04:03.800382 | finish at 2025-09-10 11:50:40 + [2025-09-10 05:46:42] iteration 8040/ 11920 | consumed samples: 8232960 | elapsed time per iteration (ms): 5625.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836519E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:03:48.324680 | finish at 2025-09-10 11:50:31 + [2025-09-10 05:46:48] iteration 8041/ 11920 | consumed samples: 8233984 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808862E+00 | loss scale: 1.0 | grad norm: 0.135 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:03:25.030954 | finish at 2025-09-10 11:50:13 + [2025-09-10 05:46:54] iteration 8042/ 11920 | consumed samples: 8235008 | elapsed time per iteration (ms): 5629.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822546E+00 | loss scale: 1.0 | grad norm: 0.130 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:03:49.388469 | finish at 2025-09-10 11:50:43 + [2025-09-10 05:46:59] iteration 8043/ 11920 | consumed samples: 8236032 | elapsed time per iteration (ms): 5623.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824688E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:03:20.244926 | finish at 2025-09-10 11:50:19 + [2025-09-10 05:47:05] iteration 8044/ 11920 | consumed samples: 8237056 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818929E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:03:19.399610 | finish at 2025-09-10 11:50:24 + [2025-09-10 05:47:10] iteration 8045/ 11920 | consumed samples: 8238080 | elapsed time per iteration (ms): 5619.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831743E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:02:53.911238 | finish at 2025-09-10 11:50:04 + [2025-09-10 05:47:16] iteration 8046/ 11920 | consumed samples: 8239104 | elapsed time per iteration (ms): 5631.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846853E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:03:37.206874 | finish at 2025-09-10 11:50:53 + [2025-09-10 05:47:22] iteration 8047/ 11920 | consumed samples: 8240128 | elapsed time per iteration (ms): 5626.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827775E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:03:12.951217 | finish at 2025-09-10 11:50:35 + [2025-09-10 05:47:27] iteration 8048/ 11920 | consumed samples: 8241152 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838665E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:02:50.098221 | finish at 2025-09-10 11:50:17 + [2025-09-10 05:47:33] iteration 8049/ 11920 | consumed samples: 8242176 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826624E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:02:35.684058 | finish at 2025-09-10 11:50:09 + [2025-09-10 05:47:39] iteration 8050/ 11920 | consumed samples: 8243200 | elapsed time per iteration (ms): 5637.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843005E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:03:38.824761 | finish at 2025-09-10 11:51:17 + [2025-09-10 05:47:44] iteration 8051/ 11920 | consumed samples: 8244224 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826686E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:02:28.745982 | finish at 2025-09-10 11:50:13 + [2025-09-10 05:47:50] iteration 8052/ 11920 | consumed samples: 8245248 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840246E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:02:39.897728 | finish at 2025-09-10 11:50:30 + [2025-09-10 05:47:55] iteration 8053/ 11920 | consumed samples: 8246272 | elapsed time per iteration (ms): 5633.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845114E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:03:06.412718 | finish at 2025-09-10 11:51:02 + [2025-09-10 05:48:01] iteration 8054/ 11920 | consumed samples: 8247296 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824370E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:02:35.917986 | finish at 2025-09-10 11:50:37 + [2025-09-10 05:48:07] iteration 8055/ 11920 | consumed samples: 8248320 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825820E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:02:12.613585 | finish at 2025-09-10 11:50:19 + [2025-09-10 05:48:12] iteration 8056/ 11920 | consumed samples: 8249344 | elapsed time per iteration (ms): 5628.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831698E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:02:26.845425 | finish at 2025-09-10 11:50:39 + [2025-09-10 05:48:18] iteration 8057/ 11920 | consumed samples: 8250368 | elapsed time per iteration (ms): 5617.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840735E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:01:41.322848 | finish at 2025-09-10 11:49:59 + [2025-09-10 05:48:24] iteration 8058/ 11920 | consumed samples: 8251392 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841239E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:01:47.318814 | finish at 2025-09-10 11:50:11 + [2025-09-10 05:48:29] iteration 8059/ 11920 | consumed samples: 8252416 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835084E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:01:41.996321 | finish at 2025-09-10 11:50:11 + [2025-09-10 05:48:35] iteration 8060/ 11920 | consumed samples: 8253440 | elapsed time per iteration (ms): 5631.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843720E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:02:15.713539 | finish at 2025-09-10 11:50:51 + [2025-09-10 05:48:40] iteration 8061/ 11920 | consumed samples: 8254464 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832513E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:01:33.843308 | finish at 2025-09-10 11:50:14 + [2025-09-10 05:48:46] iteration 8062/ 11920 | consumed samples: 8255488 | elapsed time per iteration (ms): 5851.3 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846752E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:16:14.297928 | finish at 2025-09-10 12:05:01 + [2025-09-10 05:48:52] iteration 8063/ 11920 | consumed samples: 8256512 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828094E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:01:36.373538 | finish at 2025-09-10 11:50:28 + [2025-09-10 05:48:58] iteration 8064/ 11920 | consumed samples: 8257536 | elapsed time per iteration (ms): 5639.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831627E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:02:25.237747 | finish at 2025-09-10 11:51:23 + [2025-09-10 05:49:03] iteration 8065/ 11920 | consumed samples: 8258560 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827853E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:01:24.933815 | finish at 2025-09-10 11:50:28 + [2025-09-10 05:49:09] iteration 8066/ 11920 | consumed samples: 8259584 | elapsed time per iteration (ms): 5958.6 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842627E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:22:44.383925 | finish at 2025-09-10 12:11:53 + [2025-09-10 05:49:15] iteration 8067/ 11920 | consumed samples: 8260608 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860423E+00 | loss scale: 1.0 | grad norm: 0.252 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:01:14.321052 | finish at 2025-09-10 11:50:29 + [2025-09-10 05:49:20] iteration 8068/ 11920 | consumed samples: 8261632 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848102E+00 | loss scale: 1.0 | grad norm: 0.307 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:00:56.677711 | finish at 2025-09-10 11:50:17 + [2025-09-10 05:49:26] iteration 8069/ 11920 | consumed samples: 8262656 | elapsed time per iteration (ms): 5627.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852463E+00 | loss scale: 1.0 | grad norm: 0.329 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:01:10.524891 | finish at 2025-09-10 11:50:37 + [2025-09-10 05:49:32] iteration 8070/ 11920 | consumed samples: 8263680 | elapsed time per iteration (ms): 5630.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848900E+00 | loss scale: 1.0 | grad norm: 0.371 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:01:17.674973 | finish at 2025-09-10 11:50:49 + [2025-09-10 05:49:37] iteration 8071/ 11920 | consumed samples: 8264704 | elapsed time per iteration (ms): 5629.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861192E+00 | loss scale: 1.0 | grad norm: 0.394 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:01:09.119784 | finish at 2025-09-10 11:50:46 + [2025-09-10 05:49:43] iteration 8072/ 11920 | consumed samples: 8265728 | elapsed time per iteration (ms): 5633.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843826E+00 | loss scale: 1.0 | grad norm: 0.331 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:01:18.705633 | finish at 2025-09-10 11:51:02 + [2025-09-10 05:49:49] iteration 8073/ 11920 | consumed samples: 8266752 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.853830E+00 | loss scale: 1.0 | grad norm: 0.346 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:00:48.222275 | finish at 2025-09-10 11:50:37 + [2025-09-10 05:49:54] iteration 8074/ 11920 | consumed samples: 8267776 | elapsed time per iteration (ms): 5636.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861375E+00 | loss scale: 1.0 | grad norm: 0.405 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:01:17.938199 | finish at 2025-09-10 11:51:12 + [2025-09-10 05:50:00] iteration 8075/ 11920 | consumed samples: 8268800 | elapsed time per iteration (ms): 6005.1 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870555E+00 | loss scale: 1.0 | grad norm: 0.435 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:24:49.788306 | finish at 2025-09-10 12:14:50 + [2025-09-10 05:50:06] iteration 8076/ 11920 | consumed samples: 8269824 | elapsed time per iteration (ms): 5632.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870170E+00 | loss scale: 1.0 | grad norm: 0.571 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:00:49.824882 | finish at 2025-09-10 11:50:56 + [2025-09-10 05:50:12] iteration 8077/ 11920 | consumed samples: 8270848 | elapsed time per iteration (ms): 5910.1 | throughput per GPU (TFLOP/s/GPU): 76.4 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.869723E+00 | loss scale: 1.0 | grad norm: 0.411 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:18:32.579289 | finish at 2025-09-10 12:08:44 + [2025-09-10 05:50:18] iteration 8078/ 11920 | consumed samples: 8271872 | elapsed time per iteration (ms): 6350.5 | throughput per GPU (TFLOP/s/GPU): 71.1 | MFU 7.19% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.893252E+00 | loss scale: 1.0 | grad norm: 0.577 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:46:38.480346 | finish at 2025-09-10 12:36:57 + [2025-09-10 05:50:24] iteration 8079/ 11920 | consumed samples: 8272896 | elapsed time per iteration (ms): 5640.4 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.897683E+00 | loss scale: 1.0 | grad norm: 0.609 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:01:04.889697 | finish at 2025-09-10 11:51:29 + [2025-09-10 05:50:29] iteration 8080/ 11920 | consumed samples: 8273920 | elapsed time per iteration (ms): 5649.2 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.892338E+00 | loss scale: 1.0 | grad norm: 1.267 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:01:32.900391 | finish at 2025-09-10 11:52:02 + [2025-09-10 05:50:35] iteration 8081/ 11920 | consumed samples: 8274944 | elapsed time per iteration (ms): 5660.9 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.947704E+00 | loss scale: 1.0 | grad norm: 1.532 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:02:12.077473 | finish at 2025-09-10 11:52:47 + [2025-09-10 05:50:41] iteration 8082/ 11920 | consumed samples: 8275968 | elapsed time per iteration (ms): 5686.7 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.114386E+00 | loss scale: 1.0 | grad norm: 5.105 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:03:45.489122 | finish at 2025-09-10 11:54:26 + [2025-09-10 05:50:47] iteration 8083/ 11920 | consumed samples: 8276992 | elapsed time per iteration (ms): 6067.4 | throughput per GPU (TFLOP/s/GPU): 74.4 | MFU 7.52% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.420671E+00 | loss scale: 1.0 | grad norm: 6.446 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:28:00.465518 | finish at 2025-09-10 12:18:47 + [2025-09-10 05:50:53] iteration 8084/ 11920 | consumed samples: 8278016 | elapsed time per iteration (ms): 5795.7 | throughput per GPU (TFLOP/s/GPU): 77.9 | MFU 7.88% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.899367E+00 | loss scale: 1.0 | grad norm: 14.461 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:10:32.175611 | finish at 2025-09-10 12:01:25 + [2025-09-10 05:50:58] iteration 8085/ 11920 | consumed samples: 8279040 | elapsed time per iteration (ms): 5721.9 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.957827E+00 | loss scale: 1.0 | grad norm: 4.065 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:05:43.333753 | finish at 2025-09-10 11:56:42 + [2025-09-10 05:51:04] iteration 8086/ 11920 | consumed samples: 8280064 | elapsed time per iteration (ms): 5740.2 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.952352E+00 | loss scale: 1.0 | grad norm: 2.870 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:06:48.079618 | finish at 2025-09-10 11:57:52 + [2025-09-10 05:51:10] iteration 8087/ 11920 | consumed samples: 8281088 | elapsed time per iteration (ms): 5726.8 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.603717E+00 | loss scale: 1.0 | grad norm: 10.230 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:05:50.685359 | finish at 2025-09-10 11:57:00 + [2025-09-10 05:51:16] iteration 8088/ 11920 | consumed samples: 8282112 | elapsed time per iteration (ms): 5779.1 | throughput per GPU (TFLOP/s/GPU): 78.1 | MFU 7.90% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.050298E+00 | loss scale: 1.0 | grad norm: 6.194 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:09:05.321856 | finish at 2025-09-10 12:00:21 + [2025-09-10 05:51:22] iteration 8089/ 11920 | consumed samples: 8283136 | elapsed time per iteration (ms): 6103.8 | throughput per GPU (TFLOP/s/GPU): 74.0 | MFU 7.48% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 6.092951E+00 | loss scale: 1.0 | grad norm: 5.717 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:29:43.489961 | finish at 2025-09-10 12:21:05 + [2025-09-10 05:51:27] iteration 8090/ 11920 | consumed samples: 8284160 | elapsed time per iteration (ms): 5787.4 | throughput per GPU (TFLOP/s/GPU): 78.0 | MFU 7.89% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.888678E+00 | loss scale: 1.0 | grad norm: 2.223 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:09:25.904567 | finish at 2025-09-10 12:00:53 + [2025-09-10 05:51:33] iteration 8091/ 11920 | consumed samples: 8285184 | elapsed time per iteration (ms): 5783.7 | throughput per GPU (TFLOP/s/GPU): 78.1 | MFU 7.89% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 6.364100E+00 | loss scale: 1.0 | grad norm: 4.216 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:09:05.881288 | finish at 2025-09-10 12:00:39 + [2025-09-10 05:51:39] iteration 8092/ 11920 | consumed samples: 8286208 | elapsed time per iteration (ms): 6130.4 | throughput per GPU (TFLOP/s/GPU): 73.6 | MFU 7.45% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.842372E+00 | loss scale: 1.0 | grad norm: 1.423 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:31:07.160940 | finish at 2025-09-10 12:22:46 + [2025-09-10 05:51:45] iteration 8093/ 11920 | consumed samples: 8287232 | elapsed time per iteration (ms): 6104.8 | throughput per GPU (TFLOP/s/GPU): 74.0 | MFU 7.48% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.439976E+00 | loss scale: 1.0 | grad norm: 1.081 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:29:22.943631 | finish at 2025-09-10 12:21:08 + [2025-09-10 05:51:51] iteration 8094/ 11920 | consumed samples: 8288256 | elapsed time per iteration (ms): 5756.5 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.860117E+00 | loss scale: 1.0 | grad norm: 3.659 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:07:04.347129 | finish at 2025-09-10 11:58:56 + [2025-09-10 05:51:58] iteration 8095/ 11920 | consumed samples: 8289280 | elapsed time per iteration (ms): 6401.5 | throughput per GPU (TFLOP/s/GPU): 70.5 | MFU 7.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.750600E+00 | loss scale: 1.0 | grad norm: 1.432 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:48:05.849619 | finish at 2025-09-10 12:40:03 + [2025-09-10 05:52:04] iteration 8096/ 11920 | consumed samples: 8290304 | elapsed time per iteration (ms): 6053.5 | throughput per GPU (TFLOP/s/GPU): 74.6 | MFU 7.54% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 6.194881E+00 | loss scale: 1.0 | grad norm: 3.832 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:25:48.422386 | finish at 2025-09-10 12:17:52 + [2025-09-10 05:52:10] iteration 8097/ 11920 | consumed samples: 8291328 | elapsed time per iteration (ms): 6001.2 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.570009E+00 | loss scale: 1.0 | grad norm: 1.016 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:22:22.769745 | finish at 2025-09-10 12:14:32 + [2025-09-10 05:52:16] iteration 8098/ 11920 | consumed samples: 8292352 | elapsed time per iteration (ms): 5981.3 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.391562E+00 | loss scale: 1.0 | grad norm: 1.028 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:21:00.406025 | finish at 2025-09-10 12:13:16 + [2025-09-10 05:52:22] iteration 8099/ 11920 | consumed samples: 8293376 | elapsed time per iteration (ms): 6027.4 | throughput per GPU (TFLOP/s/GPU): 74.9 | MFU 7.57% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.276737E+00 | loss scale: 1.0 | grad norm: 0.970 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:23:50.536040 | finish at 2025-09-10 12:16:12 + [2025-09-10 05:52:28] iteration 8100/ 11920 | consumed samples: 8294400 | elapsed time per iteration (ms): 6117.7 | throughput per GPU (TFLOP/s/GPU): 73.8 | MFU 7.46% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.198759E+00 | loss scale: 1.0 | grad norm: 1.041 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:29:29.574308 | finish at 2025-09-10 12:21:57 + [2025-09-10 05:52:33] iteration 8101/ 11920 | consumed samples: 8295424 | elapsed time per iteration (ms): 5714.2 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.168704E+00 | loss scale: 1.0 | grad norm: 1.091 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:03:42.649154 | finish at 2025-09-10 11:56:16 + [2025-09-10 05:52:39] iteration 8102/ 11920 | consumed samples: 8296448 | elapsed time per iteration (ms): 5789.7 | throughput per GPU (TFLOP/s/GPU): 78.0 | MFU 7.88% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.136547E+00 | loss scale: 1.0 | grad norm: 1.011 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:08:25.006448 | finish at 2025-09-10 12:01:04 + [2025-09-10 05:52:45] iteration 8103/ 11920 | consumed samples: 8297472 | elapsed time per iteration (ms): 5725.8 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.044621E+00 | loss scale: 1.0 | grad norm: 1.277 | num zeros: 31.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:04:15.310489 | finish at 2025-09-10 11:57:00 + [2025-09-10 05:52:51] iteration 8104/ 11920 | consumed samples: 8298496 | elapsed time per iteration (ms): 5708.0 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 8.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.942224E+00 | loss scale: 1.0 | grad norm: 1.006 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:03:01.916119 | finish at 2025-09-10 11:55:53 + [2025-09-10 05:52:56] iteration 8105/ 11920 | consumed samples: 8299520 | elapsed time per iteration (ms): 5765.4 | throughput per GPU (TFLOP/s/GPU): 78.3 | MFU 7.92% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.858489E+00 | loss scale: 1.0 | grad norm: 0.827 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:06:34.936165 | finish at 2025-09-10 11:59:31 + [2025-09-10 05:53:02] iteration 8106/ 11920 | consumed samples: 8300544 | elapsed time per iteration (ms): 5757.6 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.869425E+00 | loss scale: 1.0 | grad norm: 0.930 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:05:59.570322 | finish at 2025-09-10 11:59:02 + [2025-09-10 05:53:08] iteration 8107/ 11920 | consumed samples: 8301568 | elapsed time per iteration (ms): 6041.0 | throughput per GPU (TFLOP/s/GPU): 74.7 | MFU 7.56% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.720147E+00 | loss scale: 1.0 | grad norm: 0.797 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:23:54.291669 | finish at 2025-09-10 12:17:03 + [2025-09-10 05:53:14] iteration 8108/ 11920 | consumed samples: 8302592 | elapsed time per iteration (ms): 5764.8 | throughput per GPU (TFLOP/s/GPU): 78.3 | MFU 7.92% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.744294E+00 | loss scale: 1.0 | grad norm: 1.584 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:06:15.288816 | finish at 2025-09-10 11:59:29 + [2025-09-10 05:53:20] iteration 8109/ 11920 | consumed samples: 8303616 | elapsed time per iteration (ms): 5801.9 | throughput per GPU (TFLOP/s/GPU): 77.8 | MFU 7.87% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.824565E+00 | loss scale: 1.0 | grad norm: 1.366 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:08:30.896996 | finish at 2025-09-10 12:01:51 + [2025-09-10 05:53:26] iteration 8110/ 11920 | consumed samples: 8304640 | elapsed time per iteration (ms): 5767.5 | throughput per GPU (TFLOP/s/GPU): 78.3 | MFU 7.92% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.684846E+00 | loss scale: 1.0 | grad norm: 1.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:06:14.172893 | finish at 2025-09-10 11:59:40 + [2025-09-10 05:53:31] iteration 8111/ 11920 | consumed samples: 8305664 | elapsed time per iteration (ms): 5748.2 | throughput per GPU (TFLOP/s/GPU): 78.5 | MFU 7.94% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.599618E+00 | loss scale: 1.0 | grad norm: 0.730 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:04:55.078830 | finish at 2025-09-10 11:58:26 + [2025-09-10 05:53:37] iteration 8112/ 11920 | consumed samples: 8306688 | elapsed time per iteration (ms): 5811.8 | throughput per GPU (TFLOP/s/GPU): 77.7 | MFU 7.85% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.742192E+00 | loss scale: 1.0 | grad norm: 1.974 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:08:51.310806 | finish at 2025-09-10 12:02:28 + [2025-09-10 05:53:43] iteration 8113/ 11920 | consumed samples: 8307712 | elapsed time per iteration (ms): 5740.9 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.667939E+00 | loss scale: 1.0 | grad norm: 0.939 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:04:15.629143 | finish at 2025-09-10 11:57:59 + [2025-09-10 05:53:49] iteration 8114/ 11920 | consumed samples: 8308736 | elapsed time per iteration (ms): 5956.1 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.503073E+00 | loss scale: 1.0 | grad norm: 0.406 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:17:49.024534 | finish at 2025-09-10 12:11:38 + [2025-09-10 05:53:55] iteration 8115/ 11920 | consumed samples: 8309760 | elapsed time per iteration (ms): 6085.3 | throughput per GPU (TFLOP/s/GPU): 74.2 | MFU 7.50% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.407332E+00 | loss scale: 1.0 | grad norm: 0.652 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:25:54.735117 | finish at 2025-09-10 12:19:50 + [2025-09-10 05:54:01] iteration 8116/ 11920 | consumed samples: 8310784 | elapsed time per iteration (ms): 5770.8 | throughput per GPU (TFLOP/s/GPU): 78.2 | MFU 7.91% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.376694E+00 | loss scale: 1.0 | grad norm: 0.598 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:05:52.095517 | finish at 2025-09-10 11:59:53 + [2025-09-10 05:54:06] iteration 8117/ 11920 | consumed samples: 8311808 | elapsed time per iteration (ms): 5762.9 | throughput per GPU (TFLOP/s/GPU): 78.3 | MFU 7.92% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.339375E+00 | loss scale: 1.0 | grad norm: 0.625 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:05:16.326361 | finish at 2025-09-10 11:59:23 + [2025-09-10 05:54:12] iteration 8118/ 11920 | consumed samples: 8312832 | elapsed time per iteration (ms): 5800.6 | throughput per GPU (TFLOP/s/GPU): 77.8 | MFU 7.87% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.444759E+00 | loss scale: 1.0 | grad norm: 2.832 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:07:33.962979 | finish at 2025-09-10 12:01:46 + [2025-09-10 05:54:19] iteration 8119/ 11920 | consumed samples: 8313856 | elapsed time per iteration (ms): 6330.7 | throughput per GPU (TFLOP/s/GPU): 71.3 | MFU 7.21% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.414000E+00 | loss scale: 1.0 | grad norm: 1.292 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:41:03.079209 | finish at 2025-09-10 12:35:22 + [2025-09-10 05:54:24] iteration 8120/ 11920 | consumed samples: 8314880 | elapsed time per iteration (ms): 5758.2 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.305949E+00 | loss scale: 1.0 | grad norm: 0.630 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:04:41.182384 | finish at 2025-09-10 11:59:06 + [2025-09-10 05:54:30] iteration 8121/ 11920 | consumed samples: 8315904 | elapsed time per iteration (ms): 5723.9 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.228514E+00 | loss scale: 1.0 | grad norm: 0.478 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:02:24.965975 | finish at 2025-09-10 11:56:55 + [2025-09-10 05:54:36] iteration 8122/ 11920 | consumed samples: 8316928 | elapsed time per iteration (ms): 5735.5 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.185385E+00 | loss scale: 1.0 | grad norm: 0.823 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:03:03.320708 | finish at 2025-09-10 11:57:39 + [2025-09-10 05:54:42] iteration 8123/ 11920 | consumed samples: 8317952 | elapsed time per iteration (ms): 5739.6 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.145323E+00 | loss scale: 1.0 | grad norm: 0.517 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:03:13.092603 | finish at 2025-09-10 11:57:55 + [2025-09-10 05:54:47] iteration 8124/ 11920 | consumed samples: 8318976 | elapsed time per iteration (ms): 5719.4 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.135405E+00 | loss scale: 1.0 | grad norm: 0.587 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:01:50.720856 | finish at 2025-09-10 11:56:38 + [2025-09-10 05:54:53] iteration 8125/ 11920 | consumed samples: 8320000 | elapsed time per iteration (ms): 5704.0 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.073771E+00 | loss scale: 1.0 | grad norm: 0.513 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:00:46.716177 | finish at 2025-09-10 11:55:40 + [2025-09-10 05:54:59] iteration 8126/ 11920 | consumed samples: 8321024 | elapsed time per iteration (ms): 5701.1 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.034896E+00 | loss scale: 1.0 | grad norm: 0.427 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:00:30.078750 | finish at 2025-09-10 11:55:29 + [2025-09-10 05:55:04] iteration 8127/ 11920 | consumed samples: 8322048 | elapsed time per iteration (ms): 5714.0 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.987488E+00 | loss scale: 1.0 | grad norm: 0.467 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:01:13.138647 | finish at 2025-09-10 11:56:18 + [2025-09-10 05:55:10] iteration 8128/ 11920 | consumed samples: 8323072 | elapsed time per iteration (ms): 5918.7 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.019317E+00 | loss scale: 1.0 | grad norm: 0.917 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:14:03.808868 | finish at 2025-09-10 12:09:14 + [2025-09-10 05:55:16] iteration 8129/ 11920 | consumed samples: 8324096 | elapsed time per iteration (ms): 5720.1 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.980737E+00 | loss scale: 1.0 | grad norm: 0.979 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:01:24.803012 | finish at 2025-09-10 11:56:41 + [2025-09-10 05:55:22] iteration 8130/ 11920 | consumed samples: 8325120 | elapsed time per iteration (ms): 6287.9 | throughput per GPU (TFLOP/s/GPU): 71.8 | MFU 7.26% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.955043E+00 | loss scale: 1.0 | grad norm: 0.625 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:37:11.234865 | finish at 2025-09-10 12:32:34 + [2025-09-10 05:55:28] iteration 8131/ 11920 | consumed samples: 8326144 | elapsed time per iteration (ms): 5720.5 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.923300E+00 | loss scale: 1.0 | grad norm: 0.886 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:01:14.981698 | finish at 2025-09-10 11:56:43 + [2025-09-10 05:55:34] iteration 8132/ 11920 | consumed samples: 8327168 | elapsed time per iteration (ms): 5737.4 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.918497E+00 | loss scale: 1.0 | grad norm: 1.005 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:02:13.299405 | finish at 2025-09-10 11:57:47 + [2025-09-10 05:55:40] iteration 8133/ 11920 | consumed samples: 8328192 | elapsed time per iteration (ms): 5906.0 | throughput per GPU (TFLOP/s/GPU): 76.4 | MFU 7.73% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.867210E+00 | loss scale: 1.0 | grad norm: 0.735 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:12:46.069470 | finish at 2025-09-10 12:08:26 + [2025-09-10 05:55:45] iteration 8134/ 11920 | consumed samples: 8329216 | elapsed time per iteration (ms): 5726.4 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.908532E+00 | loss scale: 1.0 | grad norm: 1.649 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:01:20.276387 | finish at 2025-09-10 11:57:06 + [2025-09-10 05:55:51] iteration 8135/ 11920 | consumed samples: 8330240 | elapsed time per iteration (ms): 5721.9 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.844263E+00 | loss scale: 1.0 | grad norm: 0.665 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:00:57.509664 | finish at 2025-09-10 11:56:49 + [2025-09-10 05:55:57] iteration 8136/ 11920 | consumed samples: 8331264 | elapsed time per iteration (ms): 5700.4 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.810905E+00 | loss scale: 1.0 | grad norm: 0.763 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:59:30.305010 | finish at 2025-09-10 11:55:27 + [2025-09-10 05:56:03] iteration 8137/ 11920 | consumed samples: 8332288 | elapsed time per iteration (ms): 5712.3 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.781374E+00 | loss scale: 1.0 | grad norm: 0.621 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:00:09.598665 | finish at 2025-09-10 11:56:12 + [2025-09-10 05:56:08] iteration 8138/ 11920 | consumed samples: 8333312 | elapsed time per iteration (ms): 5699.5 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.750370E+00 | loss scale: 1.0 | grad norm: 0.670 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:59:15.334388 | finish at 2025-09-10 11:55:24 + [2025-09-10 05:56:14] iteration 8139/ 11920 | consumed samples: 8334336 | elapsed time per iteration (ms): 5725.5 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.747583E+00 | loss scale: 1.0 | grad norm: 1.039 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:00:48.187119 | finish at 2025-09-10 11:57:02 + [2025-09-10 05:56:20] iteration 8140/ 11920 | consumed samples: 8335360 | elapsed time per iteration (ms): 5688.9 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.720690E+00 | loss scale: 1.0 | grad norm: 0.955 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:58:24.186172 | finish at 2025-09-10 11:54:44 + [2025-09-10 05:56:26] iteration 8141/ 11920 | consumed samples: 8336384 | elapsed time per iteration (ms): 6056.5 | throughput per GPU (TFLOP/s/GPU): 74.5 | MFU 7.54% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.755292E+00 | loss scale: 1.0 | grad norm: 1.792 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:21:27.609747 | finish at 2025-09-10 12:17:53 + [2025-09-10 05:56:32] iteration 8142/ 11920 | consumed samples: 8337408 | elapsed time per iteration (ms): 5936.5 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.687488E+00 | loss scale: 1.0 | grad norm: 0.428 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:13:47.947750 | finish at 2025-09-10 12:10:20 + [2025-09-10 05:56:38] iteration 8143/ 11920 | consumed samples: 8338432 | elapsed time per iteration (ms): 6072.9 | throughput per GPU (TFLOP/s/GPU): 74.3 | MFU 7.52% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.650811E+00 | loss scale: 1.0 | grad norm: 0.662 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:22:17.353420 | finish at 2025-09-10 12:18:55 + [2025-09-10 05:56:43] iteration 8144/ 11920 | consumed samples: 8339456 | elapsed time per iteration (ms): 5714.1 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.694868E+00 | loss scale: 1.0 | grad norm: 1.668 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:59:36.511383 | finish at 2025-09-10 11:56:20 + [2025-09-10 05:56:49] iteration 8145/ 11920 | consumed samples: 8340480 | elapsed time per iteration (ms): 5667.9 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.655188E+00 | loss scale: 1.0 | grad norm: 0.869 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:56:36.277821 | finish at 2025-09-10 11:53:25 + [2025-09-10 05:56:55] iteration 8146/ 11920 | consumed samples: 8341504 | elapsed time per iteration (ms): 5678.9 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.655641E+00 | loss scale: 1.0 | grad norm: 1.224 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:57:12.006650 | finish at 2025-09-10 11:54:07 + [2025-09-10 05:57:00] iteration 8147/ 11920 | consumed samples: 8342528 | elapsed time per iteration (ms): 5675.5 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.600893E+00 | loss scale: 1.0 | grad norm: 0.597 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:56:53.776326 | finish at 2025-09-10 11:53:54 + [2025-09-10 05:57:06] iteration 8148/ 11920 | consumed samples: 8343552 | elapsed time per iteration (ms): 5675.8 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.559013E+00 | loss scale: 1.0 | grad norm: 0.550 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:56:49.291489 | finish at 2025-09-10 11:53:55 + [2025-09-10 05:57:12] iteration 8149/ 11920 | consumed samples: 8344576 | elapsed time per iteration (ms): 5675.5 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.565009E+00 | loss scale: 1.0 | grad norm: 0.914 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:56:42.481008 | finish at 2025-09-10 11:53:54 + [2025-09-10 05:57:17] iteration 8150/ 11920 | consumed samples: 8345600 | elapsed time per iteration (ms): 5666.1 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.538349E+00 | loss scale: 1.0 | grad norm: 0.499 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:56:01.254621 | finish at 2025-09-10 11:53:19 + [2025-09-10 05:57:23] iteration 8151/ 11920 | consumed samples: 8346624 | elapsed time per iteration (ms): 5660.9 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.511337E+00 | loss scale: 1.0 | grad norm: 0.597 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:55:35.955002 | finish at 2025-09-10 11:52:59 + [2025-09-10 05:57:29] iteration 8152/ 11920 | consumed samples: 8347648 | elapsed time per iteration (ms): 5665.0 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.512846E+00 | loss scale: 1.0 | grad norm: 0.929 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:55:45.720755 | finish at 2025-09-10 11:53:15 + [2025-09-10 05:57:34] iteration 8153/ 11920 | consumed samples: 8348672 | elapsed time per iteration (ms): 5664.9 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.469067E+00 | loss scale: 1.0 | grad norm: 0.548 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:55:39.759374 | finish at 2025-09-10 11:53:14 + [2025-09-10 05:57:40] iteration 8154/ 11920 | consumed samples: 8349696 | elapsed time per iteration (ms): 5672.1 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.478805E+00 | loss scale: 1.0 | grad norm: 0.861 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:56:01.184522 | finish at 2025-09-10 11:53:41 + [2025-09-10 05:57:46] iteration 8155/ 11920 | consumed samples: 8350720 | elapsed time per iteration (ms): 5660.0 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.456841E+00 | loss scale: 1.0 | grad norm: 0.678 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:55:09.787220 | finish at 2025-09-10 11:52:56 + [2025-09-10 05:57:51] iteration 8156/ 11920 | consumed samples: 8351744 | elapsed time per iteration (ms): 5657.2 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.407670E+00 | loss scale: 1.0 | grad norm: 0.659 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:54:53.538738 | finish at 2025-09-10 11:52:45 + [2025-09-10 05:57:57] iteration 8157/ 11920 | consumed samples: 8352768 | elapsed time per iteration (ms): 5686.0 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.511237E+00 | loss scale: 1.0 | grad norm: 3.804 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:56:36.537733 | finish at 2025-09-10 11:54:34 + [2025-09-10 05:58:03] iteration 8158/ 11920 | consumed samples: 8353792 | elapsed time per iteration (ms): 5675.4 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.432136E+00 | loss scale: 1.0 | grad norm: 0.624 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:55:50.751723 | finish at 2025-09-10 11:53:54 + [2025-09-10 05:58:09] iteration 8159/ 11920 | consumed samples: 8354816 | elapsed time per iteration (ms): 5678.7 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.401136E+00 | loss scale: 1.0 | grad norm: 0.627 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:55:57.586105 | finish at 2025-09-10 11:54:06 + [2025-09-10 05:58:14] iteration 8160/ 11920 | consumed samples: 8355840 | elapsed time per iteration (ms): 5664.4 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.403621E+00 | loss scale: 1.0 | grad norm: 0.894 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:54:58.059216 | finish at 2025-09-10 11:53:12 + [2025-09-10 05:58:20] iteration 8161/ 11920 | consumed samples: 8356864 | elapsed time per iteration (ms): 5665.5 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.354025E+00 | loss scale: 1.0 | grad norm: 0.570 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:54:56.734313 | finish at 2025-09-10 11:53:17 + [2025-09-10 05:58:26] iteration 8162/ 11920 | consumed samples: 8357888 | elapsed time per iteration (ms): 5652.3 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.369093E+00 | loss scale: 1.0 | grad norm: 1.341 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:54:01.412839 | finish at 2025-09-10 11:52:27 + [2025-09-10 05:58:31] iteration 8163/ 11920 | consumed samples: 8358912 | elapsed time per iteration (ms): 5651.8 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.378566E+00 | loss scale: 1.0 | grad norm: 1.441 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:53:53.909029 | finish at 2025-09-10 11:52:25 + [2025-09-10 05:58:37] iteration 8164/ 11920 | consumed samples: 8359936 | elapsed time per iteration (ms): 5646.8 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.367013E+00 | loss scale: 1.0 | grad norm: 0.842 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:53:29.363045 | finish at 2025-09-10 11:52:06 + [2025-09-10 05:58:42] iteration 8165/ 11920 | consumed samples: 8360960 | elapsed time per iteration (ms): 5647.0 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.326890E+00 | loss scale: 1.0 | grad norm: 0.795 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:53:24.370686 | finish at 2025-09-10 11:52:07 + [2025-09-10 05:58:48] iteration 8166/ 11920 | consumed samples: 8361984 | elapsed time per iteration (ms): 5648.9 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.326867E+00 | loss scale: 1.0 | grad norm: 0.825 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:53:25.803351 | finish at 2025-09-10 11:52:14 + [2025-09-10 05:58:54] iteration 8167/ 11920 | consumed samples: 8363008 | elapsed time per iteration (ms): 5637.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.274654E+00 | loss scale: 1.0 | grad norm: 0.279 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:52:38.800220 | finish at 2025-09-10 11:51:33 + [2025-09-10 05:58:59] iteration 8168/ 11920 | consumed samples: 8364032 | elapsed time per iteration (ms): 5644.9 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.280248E+00 | loss scale: 1.0 | grad norm: 0.463 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:52:59.783194 | finish at 2025-09-10 11:51:59 + [2025-09-10 05:59:05] iteration 8169/ 11920 | consumed samples: 8365056 | elapsed time per iteration (ms): 5651.8 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.280121E+00 | loss scale: 1.0 | grad norm: 1.185 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:53:19.922953 | finish at 2025-09-10 11:52:25 + [2025-09-10 05:59:11] iteration 8170/ 11920 | consumed samples: 8366080 | elapsed time per iteration (ms): 5646.6 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.269534E+00 | loss scale: 1.0 | grad norm: 1.288 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:52:54.657047 | finish at 2025-09-10 11:52:05 + [2025-09-10 05:59:16] iteration 8171/ 11920 | consumed samples: 8367104 | elapsed time per iteration (ms): 5674.5 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.312337E+00 | loss scale: 1.0 | grad norm: 2.344 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:54:33.604817 | finish at 2025-09-10 11:53:50 + [2025-09-10 05:59:22] iteration 8172/ 11920 | consumed samples: 8368128 | elapsed time per iteration (ms): 5652.4 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.265921E+00 | loss scale: 1.0 | grad norm: 0.609 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:53:05.050501 | finish at 2025-09-10 11:52:27 + [2025-09-10 05:59:28] iteration 8173/ 11920 | consumed samples: 8369152 | elapsed time per iteration (ms): 5889.3 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.271560E+00 | loss scale: 1.0 | grad norm: 1.240 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:07:47.137823 | finish at 2025-09-10 12:07:15 + [2025-09-10 05:59:34] iteration 8174/ 11920 | consumed samples: 8370176 | elapsed time per iteration (ms): 5638.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.243103E+00 | loss scale: 1.0 | grad norm: 0.512 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:52:01.966485 | finish at 2025-09-10 11:51:36 + [2025-09-10 05:59:39] iteration 8175/ 11920 | consumed samples: 8371200 | elapsed time per iteration (ms): 5647.8 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.222158E+00 | loss scale: 1.0 | grad norm: 0.718 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:52:30.979632 | finish at 2025-09-10 11:52:10 + [2025-09-10 05:59:45] iteration 8176/ 11920 | consumed samples: 8372224 | elapsed time per iteration (ms): 5638.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.244645E+00 | loss scale: 1.0 | grad norm: 0.859 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:51:52.103348 | finish at 2025-09-10 11:51:37 + [2025-09-10 05:59:50] iteration 8177/ 11920 | consumed samples: 8373248 | elapsed time per iteration (ms): 5634.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.195142E+00 | loss scale: 1.0 | grad norm: 0.376 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:51:29.550760 | finish at 2025-09-10 11:51:20 + [2025-09-10 05:59:56] iteration 8178/ 11920 | consumed samples: 8374272 | elapsed time per iteration (ms): 5641.9 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.182388E+00 | loss scale: 1.0 | grad norm: 0.591 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:51:51.866916 | finish at 2025-09-10 11:51:48 + [2025-09-10 06:00:02] iteration 8179/ 11920 | consumed samples: 8375296 | elapsed time per iteration (ms): 5636.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.196226E+00 | loss scale: 1.0 | grad norm: 1.500 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:51:27.401886 | finish at 2025-09-10 11:51:29 + [2025-09-10 06:00:07] iteration 8180/ 11920 | consumed samples: 8376320 | elapsed time per iteration (ms): 5663.3 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.236044E+00 | loss scale: 1.0 | grad norm: 1.418 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:53:00.626221 | finish at 2025-09-10 11:53:08 + [2025-09-10 06:00:13] iteration 8181/ 11920 | consumed samples: 8377344 | elapsed time per iteration (ms): 5635.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.198317E+00 | loss scale: 1.0 | grad norm: 0.333 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:51:12.298559 | finish at 2025-09-10 11:51:25 + [2025-09-10 06:00:19] iteration 8182/ 11920 | consumed samples: 8378368 | elapsed time per iteration (ms): 5639.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.176216E+00 | loss scale: 1.0 | grad norm: 0.395 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:51:18.795662 | finish at 2025-09-10 11:51:37 + [2025-09-10 06:00:24] iteration 8183/ 11920 | consumed samples: 8379392 | elapsed time per iteration (ms): 5637.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.167710E+00 | loss scale: 1.0 | grad norm: 0.722 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:51:08.867474 | finish at 2025-09-10 11:51:33 + [2025-09-10 06:00:30] iteration 8184/ 11920 | consumed samples: 8380416 | elapsed time per iteration (ms): 5646.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.195640E+00 | loss scale: 1.0 | grad norm: 2.324 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:51:34.461294 | finish at 2025-09-10 11:52:04 + [2025-09-10 06:00:36] iteration 8185/ 11920 | consumed samples: 8381440 | elapsed time per iteration (ms): 5648.9 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.152760E+00 | loss scale: 1.0 | grad norm: 0.925 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:51:38.794785 | finish at 2025-09-10 11:52:14 + [2025-09-10 06:00:41] iteration 8186/ 11920 | consumed samples: 8382464 | elapsed time per iteration (ms): 5647.5 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.167370E+00 | loss scale: 1.0 | grad norm: 0.738 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:51:27.733984 | finish at 2025-09-10 11:52:09 + [2025-09-10 06:00:47] iteration 8187/ 11920 | consumed samples: 8383488 | elapsed time per iteration (ms): 5640.9 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.142978E+00 | loss scale: 1.0 | grad norm: 0.684 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:50:57.650197 | finish at 2025-09-10 11:51:45 + [2025-09-10 06:00:53] iteration 8188/ 11920 | consumed samples: 8384512 | elapsed time per iteration (ms): 5641.4 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.143310E+00 | loss scale: 1.0 | grad norm: 1.006 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:50:53.564584 | finish at 2025-09-10 11:51:46 + [2025-09-10 06:00:58] iteration 8189/ 11920 | consumed samples: 8385536 | elapsed time per iteration (ms): 5873.9 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.122268E+00 | loss scale: 1.0 | grad norm: 0.437 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:05:15.584711 | finish at 2025-09-10 12:06:14 + [2025-09-10 06:01:04] iteration 8190/ 11920 | consumed samples: 8386560 | elapsed time per iteration (ms): 5635.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.123730E+00 | loss scale: 1.0 | grad norm: 0.625 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:50:18.801637 | finish at 2025-09-10 11:51:23 + [2025-09-10 06:01:10] iteration 8191/ 11920 | consumed samples: 8387584 | elapsed time per iteration (ms): 5648.5 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.125976E+00 | loss scale: 1.0 | grad norm: 0.808 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:51:03.139905 | finish at 2025-09-10 11:52:13 + [2025-09-10 06:01:15] iteration 8192/ 11920 | consumed samples: 8388608 | elapsed time per iteration (ms): 5652.1 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.102702E+00 | loss scale: 1.0 | grad norm: 0.733 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:51:10.936684 | finish at 2025-09-10 11:52:26 + [2025-09-10 06:01:21] iteration 8193/ 11920 | consumed samples: 8389632 | elapsed time per iteration (ms): 5988.1 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.116226E+00 | loss scale: 1.0 | grad norm: 1.011 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:11:57.599133 | finish at 2025-09-10 12:13:19 + [2025-09-10 06:01:27] iteration 8194/ 11920 | consumed samples: 8390656 | elapsed time per iteration (ms): 5636.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.100468E+00 | loss scale: 1.0 | grad norm: 0.466 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:50:02.907984 | finish at 2025-09-10 11:51:30 + [2025-09-10 06:01:33] iteration 8195/ 11920 | consumed samples: 8391680 | elapsed time per iteration (ms): 5961.2 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.068121E+00 | loss scale: 1.0 | grad norm: 0.302 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:10:05.369639 | finish at 2025-09-10 12:11:38 + [2025-09-10 06:01:39] iteration 8196/ 11920 | consumed samples: 8392704 | elapsed time per iteration (ms): 5638.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.073401E+00 | loss scale: 1.0 | grad norm: 0.315 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:49:58.644021 | finish at 2025-09-10 11:51:37 + [2025-09-10 06:01:44] iteration 8197/ 11920 | consumed samples: 8393728 | elapsed time per iteration (ms): 5638.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.059082E+00 | loss scale: 1.0 | grad norm: 0.425 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:49:50.262503 | finish at 2025-09-10 11:51:34 + [2025-09-10 06:01:50] iteration 8198/ 11920 | consumed samples: 8394752 | elapsed time per iteration (ms): 5645.2 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.062480E+00 | loss scale: 1.0 | grad norm: 1.390 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:50:11.510768 | finish at 2025-09-10 11:52:01 + [2025-09-10 06:01:55] iteration 8199/ 11920 | consumed samples: 8395776 | elapsed time per iteration (ms): 5640.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.089413E+00 | loss scale: 1.0 | grad norm: 1.146 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:49:46.588545 | finish at 2025-09-10 11:51:42 + [2025-09-10 06:02:01] iteration 8200/ 11920 | consumed samples: 8396800 | elapsed time per iteration (ms): 5641.6 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.081795E+00 | loss scale: 1.0 | grad norm: 1.072 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:49:46.868677 | finish at 2025-09-10 11:51:48 + [2025-09-10 06:02:07] iteration 8201/ 11920 | consumed samples: 8397824 | elapsed time per iteration (ms): 5644.5 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.095323E+00 | loss scale: 1.0 | grad norm: 1.049 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:49:51.760789 | finish at 2025-09-10 11:51:59 + [2025-09-10 06:02:12] iteration 8202/ 11920 | consumed samples: 8398848 | elapsed time per iteration (ms): 5637.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.059216E+00 | loss scale: 1.0 | grad norm: 0.535 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:49:19.992043 | finish at 2025-09-10 11:51:32 + [2025-09-10 06:02:18] iteration 8203/ 11920 | consumed samples: 8399872 | elapsed time per iteration (ms): 5636.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.052393E+00 | loss scale: 1.0 | grad norm: 0.732 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:49:10.139830 | finish at 2025-09-10 11:51:28 + [2025-09-10 06:02:24] iteration 8204/ 11920 | consumed samples: 8400896 | elapsed time per iteration (ms): 5630.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.042827E+00 | loss scale: 1.0 | grad norm: 0.532 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:48:44.253946 | finish at 2025-09-10 11:51:08 + [2025-09-10 06:02:29] iteration 8205/ 11920 | consumed samples: 8401920 | elapsed time per iteration (ms): 5637.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.041285E+00 | loss scale: 1.0 | grad norm: 0.669 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:49:03.475651 | finish at 2025-09-10 11:51:33 + [2025-09-10 06:02:35] iteration 8206/ 11920 | consumed samples: 8402944 | elapsed time per iteration (ms): 5633.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.041593E+00 | loss scale: 1.0 | grad norm: 0.513 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:48:42.568776 | finish at 2025-09-10 11:51:18 + [2025-09-10 06:02:41] iteration 8207/ 11920 | consumed samples: 8403968 | elapsed time per iteration (ms): 5637.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.031589E+00 | loss scale: 1.0 | grad norm: 0.463 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:48:53.008794 | finish at 2025-09-10 11:51:34 + [2025-09-10 06:02:46] iteration 8208/ 11920 | consumed samples: 8404992 | elapsed time per iteration (ms): 5629.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.027381E+00 | loss scale: 1.0 | grad norm: 0.481 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:48:18.286957 | finish at 2025-09-10 11:51:05 + [2025-09-10 06:02:52] iteration 8209/ 11920 | consumed samples: 8406016 | elapsed time per iteration (ms): 5635.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.033051E+00 | loss scale: 1.0 | grad norm: 0.517 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:48:32.497143 | finish at 2025-09-10 11:51:24 + [2025-09-10 06:02:57] iteration 8210/ 11920 | consumed samples: 8407040 | elapsed time per iteration (ms): 5631.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.023023E+00 | loss scale: 1.0 | grad norm: 0.384 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:48:12.999470 | finish at 2025-09-10 11:51:10 + [2025-09-10 06:03:03] iteration 8211/ 11920 | consumed samples: 8408064 | elapsed time per iteration (ms): 5636.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.016973E+00 | loss scale: 1.0 | grad norm: 0.384 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:48:26.448357 | finish at 2025-09-10 11:51:30 + [2025-09-10 06:03:09] iteration 8212/ 11920 | consumed samples: 8409088 | elapsed time per iteration (ms): 6000.6 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.017099E+00 | loss scale: 1.0 | grad norm: 1.161 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:10:50.378111 | finish at 2025-09-10 12:14:00 + [2025-09-10 06:03:15] iteration 8213/ 11920 | consumed samples: 8410112 | elapsed time per iteration (ms): 5638.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.027841E+00 | loss scale: 1.0 | grad norm: 0.951 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:48:20.409847 | finish at 2025-09-10 11:51:35 + [2025-09-10 06:03:20] iteration 8214/ 11920 | consumed samples: 8411136 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.015822E+00 | loss scale: 1.0 | grad norm: 0.522 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:47:30.487646 | finish at 2025-09-10 11:50:51 + [2025-09-10 06:03:26] iteration 8215/ 11920 | consumed samples: 8412160 | elapsed time per iteration (ms): 5901.0 | throughput per GPU (TFLOP/s/GPU): 76.5 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.027378E+00 | loss scale: 1.0 | grad norm: 0.669 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:04:23.102617 | finish at 2025-09-10 12:07:49 + [2025-09-10 06:03:32] iteration 8216/ 11920 | consumed samples: 8413184 | elapsed time per iteration (ms): 5634.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.001565E+00 | loss scale: 1.0 | grad norm: 0.668 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:47:48.723915 | finish at 2025-09-10 11:51:21 + [2025-09-10 06:03:38] iteration 8217/ 11920 | consumed samples: 8414208 | elapsed time per iteration (ms): 5649.3 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.992342E+00 | loss scale: 1.0 | grad norm: 0.368 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:48:39.311473 | finish at 2025-09-10 11:52:17 + [2025-09-10 06:03:43] iteration 8218/ 11920 | consumed samples: 8415232 | elapsed time per iteration (ms): 5634.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.995452E+00 | loss scale: 1.0 | grad norm: 0.584 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:47:36.954374 | finish at 2025-09-10 11:51:20 + [2025-09-10 06:03:49] iteration 8219/ 11920 | consumed samples: 8416256 | elapsed time per iteration (ms): 5633.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.008357E+00 | loss scale: 1.0 | grad norm: 0.955 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:47:30.600377 | finish at 2025-09-10 11:51:19 + [2025-09-10 06:03:54] iteration 8220/ 11920 | consumed samples: 8417280 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.997269E+00 | loss scale: 1.0 | grad norm: 0.360 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:46:38.009825 | finish at 2025-09-10 11:50:32 + [2025-09-10 06:04:00] iteration 8221/ 11920 | consumed samples: 8418304 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.994796E+00 | loss scale: 1.0 | grad norm: 0.443 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:46:49.824108 | finish at 2025-09-10 11:50:50 + [2025-09-10 06:04:06] iteration 8222/ 11920 | consumed samples: 8419328 | elapsed time per iteration (ms): 5634.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.997981E+00 | loss scale: 1.0 | grad norm: 0.686 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:47:17.551959 | finish at 2025-09-10 11:51:23 + [2025-09-10 06:04:11] iteration 8223/ 11920 | consumed samples: 8420352 | elapsed time per iteration (ms): 5637.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.009480E+00 | loss scale: 1.0 | grad norm: 0.499 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:47:22.680327 | finish at 2025-09-10 11:51:34 + [2025-09-10 06:04:17] iteration 8224/ 11920 | consumed samples: 8421376 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.995221E+00 | loss scale: 1.0 | grad norm: 0.562 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:46:16.392586 | finish at 2025-09-10 11:50:33 + [2025-09-10 06:04:23] iteration 8225/ 11920 | consumed samples: 8422400 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.008302E+00 | loss scale: 1.0 | grad norm: 0.417 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:46:18.598567 | finish at 2025-09-10 11:50:41 + [2025-09-10 06:04:29] iteration 8226/ 11920 | consumed samples: 8423424 | elapsed time per iteration (ms): 5928.8 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.980440E+00 | loss scale: 1.0 | grad norm: 0.571 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:05:01.155809 | finish at 2025-09-10 12:09:30 + [2025-09-10 06:04:34] iteration 8227/ 11920 | consumed samples: 8424448 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.979512E+00 | loss scale: 1.0 | grad norm: 0.747 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:46:15.408084 | finish at 2025-09-10 11:50:50 + [2025-09-10 06:04:40] iteration 8228/ 11920 | consumed samples: 8425472 | elapsed time per iteration (ms): 5961.6 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.999098E+00 | loss scale: 1.0 | grad norm: 0.608 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:06:50.138536 | finish at 2025-09-10 12:11:30 + [2025-09-10 06:04:46] iteration 8229/ 11920 | consumed samples: 8426496 | elapsed time per iteration (ms): 5974.9 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.975135E+00 | loss scale: 1.0 | grad norm: 0.527 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:07:33.226566 | finish at 2025-09-10 12:12:19 + [2025-09-10 06:04:52] iteration 8230/ 11920 | consumed samples: 8427520 | elapsed time per iteration (ms): 5830.9 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.991594E+00 | loss scale: 1.0 | grad norm: 1.610 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:58:35.975082 | finish at 2025-09-10 12:03:28 + [2025-09-10 06:04:58] iteration 8231/ 11920 | consumed samples: 8428544 | elapsed time per iteration (ms): 5971.1 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.985367E+00 | loss scale: 1.0 | grad norm: 0.460 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:07:07.309961 | finish at 2025-09-10 12:12:05 + [2025-09-10 06:05:04] iteration 8232/ 11920 | consumed samples: 8429568 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.966164E+00 | loss scale: 1.0 | grad norm: 0.511 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:45:31.827391 | finish at 2025-09-10 11:50:35 + [2025-09-10 06:05:09] iteration 8233/ 11920 | consumed samples: 8430592 | elapsed time per iteration (ms): 5647.6 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.971134E+00 | loss scale: 1.0 | grad norm: 0.527 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:47:02.613936 | finish at 2025-09-10 11:52:12 + [2025-09-10 06:05:15] iteration 8234/ 11920 | consumed samples: 8431616 | elapsed time per iteration (ms): 5954.0 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.967628E+00 | loss scale: 1.0 | grad norm: 0.554 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:05:46.413228 | finish at 2025-09-10 12:11:02 + [2025-09-10 06:05:21] iteration 8235/ 11920 | consumed samples: 8432640 | elapsed time per iteration (ms): 5618.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.975263E+00 | loss scale: 1.0 | grad norm: 0.483 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:45:05.414780 | finish at 2025-09-10 11:50:26 + [2025-09-10 06:05:26] iteration 8236/ 11920 | consumed samples: 8433664 | elapsed time per iteration (ms): 5633.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.965312E+00 | loss scale: 1.0 | grad norm: 1.116 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:45:53.079200 | finish at 2025-09-10 11:51:19 + [2025-09-10 06:05:32] iteration 8237/ 11920 | consumed samples: 8434688 | elapsed time per iteration (ms): 5968.0 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.989827E+00 | loss scale: 1.0 | grad norm: 0.595 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:06:20.229815 | finish at 2025-09-10 12:11:53 + [2025-09-10 06:05:38] iteration 8238/ 11920 | consumed samples: 8435712 | elapsed time per iteration (ms): 5626.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.968757E+00 | loss scale: 1.0 | grad norm: 0.472 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:45:16.548746 | finish at 2025-09-10 11:50:55 + [2025-09-10 06:05:44] iteration 8239/ 11920 | consumed samples: 8436736 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.962335E+00 | loss scale: 1.0 | grad norm: 0.643 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:45:07.371461 | finish at 2025-09-10 11:50:51 + [2025-09-10 06:05:50] iteration 8240/ 11920 | consumed samples: 8437760 | elapsed time per iteration (ms): 5944.4 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.981784E+00 | loss scale: 1.0 | grad norm: 0.738 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:04:35.465088 | finish at 2025-09-10 12:10:25 + [2025-09-10 06:05:55] iteration 8241/ 11920 | consumed samples: 8438784 | elapsed time per iteration (ms): 5626.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.953578E+00 | loss scale: 1.0 | grad norm: 0.401 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:45:00.900936 | finish at 2025-09-10 11:50:56 + [2025-09-10 06:06:01] iteration 8242/ 11920 | consumed samples: 8439808 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.964019E+00 | loss scale: 1.0 | grad norm: 0.429 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:44:51.577137 | finish at 2025-09-10 11:50:52 + [2025-09-10 06:06:07] iteration 8243/ 11920 | consumed samples: 8440832 | elapsed time per iteration (ms): 5934.4 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.947003E+00 | loss scale: 1.0 | grad norm: 0.405 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:03:40.642697 | finish at 2025-09-10 12:09:47 + [2025-09-10 06:06:13] iteration 8244/ 11920 | consumed samples: 8441856 | elapsed time per iteration (ms): 5975.9 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.960075E+00 | loss scale: 1.0 | grad norm: 0.479 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:06:07.326653 | finish at 2025-09-10 12:12:20 + [2025-09-10 06:06:19] iteration 8245/ 11920 | consumed samples: 8442880 | elapsed time per iteration (ms): 5955.8 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.944679E+00 | loss scale: 1.0 | grad norm: 0.485 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:04:47.522274 | finish at 2025-09-10 12:11:06 + [2025-09-10 06:06:24] iteration 8246/ 11920 | consumed samples: 8443904 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.952027E+00 | loss scale: 1.0 | grad norm: 0.463 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:44:20.399531 | finish at 2025-09-10 11:50:45 + [2025-09-10 06:06:30] iteration 8247/ 11920 | consumed samples: 8444928 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.939538E+00 | loss scale: 1.0 | grad norm: 0.473 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:44:18.232556 | finish at 2025-09-10 11:50:48 + [2025-09-10 06:06:36] iteration 8248/ 11920 | consumed samples: 8445952 | elapsed time per iteration (ms): 5627.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.946210E+00 | loss scale: 1.0 | grad norm: 0.635 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:44:24.556664 | finish at 2025-09-10 11:51:00 + [2025-09-10 06:06:41] iteration 8249/ 11920 | consumed samples: 8446976 | elapsed time per iteration (ms): 5628.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.945772E+00 | loss scale: 1.0 | grad norm: 0.426 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:44:21.756944 | finish at 2025-09-10 11:51:03 + [2025-09-10 06:06:47] iteration 8250/ 11920 | consumed samples: 8448000 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.936263E+00 | loss scale: 1.0 | grad norm: 0.418 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:43:53.177421 | finish at 2025-09-10 11:50:40 + [2025-09-10 06:06:52] iteration 8251/ 11920 | consumed samples: 8449024 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.928419E+00 | loss scale: 1.0 | grad norm: 0.426 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:43:40.924638 | finish at 2025-09-10 11:50:33 + [2025-09-10 06:06:58] iteration 8252/ 11920 | consumed samples: 8450048 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.931754E+00 | loss scale: 1.0 | grad norm: 0.364 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:43:55.350059 | finish at 2025-09-10 11:50:53 + [2025-09-10 06:07:04] iteration 8253/ 11920 | consumed samples: 8451072 | elapsed time per iteration (ms): 5628.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930093E+00 | loss scale: 1.0 | grad norm: 0.385 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:44:00.980649 | finish at 2025-09-10 11:51:05 + [2025-09-10 06:07:10] iteration 8254/ 11920 | consumed samples: 8452096 | elapsed time per iteration (ms): 5864.2 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.928832E+00 | loss scale: 1.0 | grad norm: 0.373 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:58:18.244153 | finish at 2025-09-10 12:05:28 + [2025-09-10 06:07:15] iteration 8255/ 11920 | consumed samples: 8453120 | elapsed time per iteration (ms): 5916.6 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.929100E+00 | loss scale: 1.0 | grad norm: 0.399 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:01:24.512846 | finish at 2025-09-10 12:08:40 + [2025-09-10 06:07:21] iteration 8256/ 11920 | consumed samples: 8454144 | elapsed time per iteration (ms): 5618.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.932729E+00 | loss scale: 1.0 | grad norm: 0.542 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:43:04.688435 | finish at 2025-09-10 11:50:26 + [2025-09-10 06:07:27] iteration 8257/ 11920 | consumed samples: 8455168 | elapsed time per iteration (ms): 5954.2 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.946929E+00 | loss scale: 1.0 | grad norm: 0.754 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:03:30.287107 | finish at 2025-09-10 12:10:57 + [2025-09-10 06:07:33] iteration 8258/ 11920 | consumed samples: 8456192 | elapsed time per iteration (ms): 6165.2 | throughput per GPU (TFLOP/s/GPU): 73.2 | MFU 7.40% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.931203E+00 | loss scale: 1.0 | grad norm: 0.375 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:16:16.833165 | finish at 2025-09-10 12:23:50 + [2025-09-10 06:07:39] iteration 8259/ 11920 | consumed samples: 8457216 | elapsed time per iteration (ms): 6258.3 | throughput per GPU (TFLOP/s/GPU): 72.1 | MFU 7.29% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.931521E+00 | loss scale: 1.0 | grad norm: 0.533 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:21:51.455862 | finish at 2025-09-10 12:29:31 + [2025-09-10 06:07:45] iteration 8260/ 11920 | consumed samples: 8458240 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.930750E+00 | loss scale: 1.0 | grad norm: 0.602 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:43:05.273967 | finish at 2025-09-10 11:50:50 + [2025-09-10 06:07:51] iteration 8261/ 11920 | consumed samples: 8459264 | elapsed time per iteration (ms): 5919.7 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.922151E+00 | loss scale: 1.0 | grad norm: 0.743 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:01:00.219473 | finish at 2025-09-10 12:08:51 + [2025-09-10 06:07:57] iteration 8262/ 11920 | consumed samples: 8460288 | elapsed time per iteration (ms): 6014.0 | throughput per GPU (TFLOP/s/GPU): 75.1 | MFU 7.59% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.929439E+00 | loss scale: 1.0 | grad norm: 0.449 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:06:39.039966 | finish at 2025-09-10 12:14:36 + [2025-09-10 06:08:03] iteration 8263/ 11920 | consumed samples: 8461312 | elapsed time per iteration (ms): 5630.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.942031E+00 | loss scale: 1.0 | grad norm: 0.627 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:43:09.267896 | finish at 2025-09-10 11:51:12 + [2025-09-10 06:08:08] iteration 8264/ 11920 | consumed samples: 8462336 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.945806E+00 | loss scale: 1.0 | grad norm: 0.915 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:42:46.083471 | finish at 2025-09-10 11:50:54 + [2025-09-10 06:08:14] iteration 8265/ 11920 | consumed samples: 8463360 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.924829E+00 | loss scale: 1.0 | grad norm: 0.265 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:42:33.612300 | finish at 2025-09-10 11:50:47 + [2025-09-10 06:08:20] iteration 8266/ 11920 | consumed samples: 8464384 | elapsed time per iteration (ms): 5894.0 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.910971E+00 | loss scale: 1.0 | grad norm: 0.313 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:58:56.676195 | finish at 2025-09-10 12:07:16 + [2025-09-10 06:08:25] iteration 8267/ 11920 | consumed samples: 8465408 | elapsed time per iteration (ms): 5628.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.913067E+00 | loss scale: 1.0 | grad norm: 0.265 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:42:41.882417 | finish at 2025-09-10 11:51:07 + [2025-09-10 06:08:31] iteration 8268/ 11920 | consumed samples: 8466432 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.907447E+00 | loss scale: 1.0 | grad norm: 0.289 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:42:26.679382 | finish at 2025-09-10 11:50:58 + [2025-09-10 06:08:37] iteration 8269/ 11920 | consumed samples: 8467456 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.921618E+00 | loss scale: 1.0 | grad norm: 0.514 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:41:58.957323 | finish at 2025-09-10 11:50:36 + [2025-09-10 06:08:42] iteration 8270/ 11920 | consumed samples: 8468480 | elapsed time per iteration (ms): 5632.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.920941E+00 | loss scale: 1.0 | grad norm: 0.607 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:42:39.332252 | finish at 2025-09-10 11:51:22 + [2025-09-10 06:08:48] iteration 8271/ 11920 | consumed samples: 8469504 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.907409E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:42:02.647896 | finish at 2025-09-10 11:50:51 + [2025-09-10 06:08:54] iteration 8272/ 11920 | consumed samples: 8470528 | elapsed time per iteration (ms): 5618.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.920479E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:41:37.200348 | finish at 2025-09-10 11:50:31 + [2025-09-10 06:08:59] iteration 8273/ 11920 | consumed samples: 8471552 | elapsed time per iteration (ms): 5919.5 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.917572E+00 | loss scale: 1.0 | grad norm: 0.280 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:59:48.315177 | finish at 2025-09-10 12:08:48 + [2025-09-10 06:09:05] iteration 8274/ 11920 | consumed samples: 8472576 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.902158E+00 | loss scale: 1.0 | grad norm: 0.336 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:41:29.678995 | finish at 2025-09-10 11:50:35 + [2025-09-10 06:09:11] iteration 8275/ 11920 | consumed samples: 8473600 | elapsed time per iteration (ms): 5630.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.907226E+00 | loss scale: 1.0 | grad norm: 0.444 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:42:03.293581 | finish at 2025-09-10 11:51:14 + [2025-09-10 06:09:17] iteration 8276/ 11920 | consumed samples: 8474624 | elapsed time per iteration (ms): 6056.7 | throughput per GPU (TFLOP/s/GPU): 74.5 | MFU 7.54% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.913365E+00 | loss scale: 1.0 | grad norm: 0.509 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:07:50.746825 | finish at 2025-09-10 12:17:08 + [2025-09-10 06:09:22] iteration 8277/ 11920 | consumed samples: 8475648 | elapsed time per iteration (ms): 5619.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.904445E+00 | loss scale: 1.0 | grad norm: 0.414 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:41:12.080543 | finish at 2025-09-10 11:50:34 + [2025-09-10 06:09:28] iteration 8278/ 11920 | consumed samples: 8476672 | elapsed time per iteration (ms): 5637.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.902967E+00 | loss scale: 1.0 | grad norm: 0.436 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:42:12.147683 | finish at 2025-09-10 11:51:40 + [2025-09-10 06:09:34] iteration 8279/ 11920 | consumed samples: 8477696 | elapsed time per iteration (ms): 5634.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.895262E+00 | loss scale: 1.0 | grad norm: 0.585 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:41:56.633911 | finish at 2025-09-10 11:51:30 + [2025-09-10 06:09:39] iteration 8280/ 11920 | consumed samples: 8478720 | elapsed time per iteration (ms): 5630.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.904803E+00 | loss scale: 1.0 | grad norm: 0.425 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:41:36.647491 | finish at 2025-09-10 11:51:16 + [2025-09-10 06:09:45] iteration 8281/ 11920 | consumed samples: 8479744 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.901694E+00 | loss scale: 1.0 | grad norm: 0.479 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:41:09.728115 | finish at 2025-09-10 11:50:55 + [2025-09-10 06:09:51] iteration 8282/ 11920 | consumed samples: 8480768 | elapsed time per iteration (ms): 5628.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.917146E+00 | loss scale: 1.0 | grad norm: 0.523 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:41:14.594687 | finish at 2025-09-10 11:51:05 + [2025-09-10 06:09:56] iteration 8283/ 11920 | consumed samples: 8481792 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.898717E+00 | loss scale: 1.0 | grad norm: 0.359 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:41:00.175759 | finish at 2025-09-10 11:50:56 + [2025-09-10 06:10:02] iteration 8284/ 11920 | consumed samples: 8482816 | elapsed time per iteration (ms): 5629.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.889173E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:41:08.937100 | finish at 2025-09-10 11:51:11 + [2025-09-10 06:10:08] iteration 8285/ 11920 | consumed samples: 8483840 | elapsed time per iteration (ms): 5869.0 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.892025E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:55:33.818314 | finish at 2025-09-10 12:05:41 + [2025-09-10 06:10:13] iteration 8286/ 11920 | consumed samples: 8484864 | elapsed time per iteration (ms): 5832.7 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.892362E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:53:15.868966 | finish at 2025-09-10 12:03:29 + [2025-09-10 06:10:19] iteration 8287/ 11920 | consumed samples: 8485888 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.897942E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:40:18.731995 | finish at 2025-09-10 11:50:38 + [2025-09-10 06:10:25] iteration 8288/ 11920 | consumed samples: 8486912 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.905829E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:40:14.736141 | finish at 2025-09-10 11:50:39 + [2025-09-10 06:10:30] iteration 8289/ 11920 | consumed samples: 8487936 | elapsed time per iteration (ms): 5618.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.887820E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:39:59.513023 | finish at 2025-09-10 11:50:30 + [2025-09-10 06:10:36] iteration 8290/ 11920 | consumed samples: 8488960 | elapsed time per iteration (ms): 5619.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.890550E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:39:56.983695 | finish at 2025-09-10 11:50:33 + [2025-09-10 06:10:42] iteration 8291/ 11920 | consumed samples: 8489984 | elapsed time per iteration (ms): 5629.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.886424E+00 | loss scale: 1.0 | grad norm: 0.297 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:40:30.044397 | finish at 2025-09-10 11:51:12 + [2025-09-10 06:10:48] iteration 8292/ 11920 | consumed samples: 8491008 | elapsed time per iteration (ms): 5978.7 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.881800E+00 | loss scale: 1.0 | grad norm: 0.367 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:01:30.830577 | finish at 2025-09-10 12:12:18 + [2025-09-10 06:10:53] iteration 8293/ 11920 | consumed samples: 8492032 | elapsed time per iteration (ms): 5629.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.877373E+00 | loss scale: 1.0 | grad norm: 0.276 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:40:19.443143 | finish at 2025-09-10 11:51:13 + [2025-09-10 06:10:59] iteration 8294/ 11920 | consumed samples: 8493056 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.882317E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:39:39.293584 | finish at 2025-09-10 11:50:38 + [2025-09-10 06:11:04] iteration 8295/ 11920 | consumed samples: 8494080 | elapsed time per iteration (ms): 5614.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.900732E+00 | loss scale: 1.0 | grad norm: 0.403 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:39:11.399362 | finish at 2025-09-10 11:50:16 + [2025-09-10 06:11:10] iteration 8296/ 11920 | consumed samples: 8495104 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.898113E+00 | loss scale: 1.0 | grad norm: 0.620 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:39:42.702547 | finish at 2025-09-10 11:50:53 + [2025-09-10 06:11:16] iteration 8297/ 11920 | consumed samples: 8496128 | elapsed time per iteration (ms): 5637.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.906852E+00 | loss scale: 1.0 | grad norm: 0.515 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:40:25.440948 | finish at 2025-09-10 11:51:41 + [2025-09-10 06:11:21] iteration 8298/ 11920 | consumed samples: 8497152 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.885339E+00 | loss scale: 1.0 | grad norm: 0.298 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:39:26.157650 | finish at 2025-09-10 11:50:47 + [2025-09-10 06:11:27] iteration 8299/ 11920 | consumed samples: 8498176 | elapsed time per iteration (ms): 5629.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.900291E+00 | loss scale: 1.0 | grad norm: 0.352 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:39:44.493427 | finish at 2025-09-10 11:51:11 + [2025-09-10 06:11:33] iteration 8300/ 11920 | consumed samples: 8499200 | elapsed time per iteration (ms): 5629.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.895348E+00 | loss scale: 1.0 | grad norm: 0.259 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:39:37.602954 | finish at 2025-09-10 11:51:10 + [2025-09-10 06:11:38] iteration 8301/ 11920 | consumed samples: 8500224 | elapsed time per iteration (ms): 5628.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.907211E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:39:30.185984 | finish at 2025-09-10 11:51:08 + [2025-09-10 06:11:44] iteration 8302/ 11920 | consumed samples: 8501248 | elapsed time per iteration (ms): 5632.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.892108E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:39:36.544836 | finish at 2025-09-10 11:51:20 + [2025-09-10 06:11:49] iteration 8303/ 11920 | consumed samples: 8502272 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.890240E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:38:52.175633 | finish at 2025-09-10 11:50:42 + [2025-09-10 06:11:55] iteration 8304/ 11920 | consumed samples: 8503296 | elapsed time per iteration (ms): 5616.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.886953E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:38:29.462791 | finish at 2025-09-10 11:50:25 + [2025-09-10 06:12:01] iteration 8305/ 11920 | consumed samples: 8504320 | elapsed time per iteration (ms): 5954.9 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.878633E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:58:46.843793 | finish at 2025-09-10 12:10:48 + [2025-09-10 06:12:07] iteration 8306/ 11920 | consumed samples: 8505344 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.889297E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:38:43.450885 | finish at 2025-09-10 11:50:50 + [2025-09-10 06:12:12] iteration 8307/ 11920 | consumed samples: 8506368 | elapsed time per iteration (ms): 5634.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.888850E+00 | loss scale: 1.0 | grad norm: 0.284 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:39:18.984484 | finish at 2025-09-10 11:51:31 + [2025-09-10 06:12:18] iteration 8308/ 11920 | consumed samples: 8507392 | elapsed time per iteration (ms): 5631.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.884313E+00 | loss scale: 1.0 | grad norm: 0.374 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:39:01.381047 | finish at 2025-09-10 11:51:19 + [2025-09-10 06:12:24] iteration 8309/ 11920 | consumed samples: 8508416 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.878949E+00 | loss scale: 1.0 | grad norm: 0.451 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:38:26.724920 | finish at 2025-09-10 11:50:50 + [2025-09-10 06:12:29] iteration 8310/ 11920 | consumed samples: 8509440 | elapsed time per iteration (ms): 5630.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.905183E+00 | loss scale: 1.0 | grad norm: 0.432 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:38:46.983187 | finish at 2025-09-10 11:51:16 + [2025-09-10 06:12:35] iteration 8311/ 11920 | consumed samples: 8510464 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.907568E+00 | loss scale: 1.0 | grad norm: 0.405 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:38:16.690150 | finish at 2025-09-10 11:50:52 + [2025-09-10 06:12:40] iteration 8312/ 11920 | consumed samples: 8511488 | elapsed time per iteration (ms): 5629.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.881079E+00 | loss scale: 1.0 | grad norm: 0.364 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:38:29.406008 | finish at 2025-09-10 11:51:10 + [2025-09-10 06:12:46] iteration 8313/ 11920 | consumed samples: 8512512 | elapsed time per iteration (ms): 5630.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.887955E+00 | loss scale: 1.0 | grad norm: 0.324 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:38:30.587163 | finish at 2025-09-10 11:51:17 + [2025-09-10 06:12:52] iteration 8314/ 11920 | consumed samples: 8513536 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.883030E+00 | loss scale: 1.0 | grad norm: 0.347 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:37:53.151157 | finish at 2025-09-10 11:50:45 + [2025-09-10 06:12:57] iteration 8315/ 11920 | consumed samples: 8514560 | elapsed time per iteration (ms): 5629.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.874998E+00 | loss scale: 1.0 | grad norm: 0.327 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:38:12.984018 | finish at 2025-09-10 11:51:10 + [2025-09-10 06:13:03] iteration 8316/ 11920 | consumed samples: 8515584 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.873708E+00 | loss scale: 1.0 | grad norm: 0.288 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:37:43.148667 | finish at 2025-09-10 11:50:46 + [2025-09-10 06:13:09] iteration 8317/ 11920 | consumed samples: 8516608 | elapsed time per iteration (ms): 5632.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.886451E+00 | loss scale: 1.0 | grad norm: 0.316 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:38:14.116310 | finish at 2025-09-10 11:51:23 + [2025-09-10 06:13:14] iteration 8318/ 11920 | consumed samples: 8517632 | elapsed time per iteration (ms): 5629.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.886961E+00 | loss scale: 1.0 | grad norm: 0.279 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:37:56.471941 | finish at 2025-09-10 11:51:11 + [2025-09-10 06:13:20] iteration 8319/ 11920 | consumed samples: 8518656 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.872886E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:37:18.182794 | finish at 2025-09-10 11:50:38 + [2025-09-10 06:13:25] iteration 8320/ 11920 | consumed samples: 8519680 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.876792E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:37:17.478161 | finish at 2025-09-10 11:50:43 + [2025-09-10 06:13:31] iteration 8321/ 11920 | consumed samples: 8520704 | elapsed time per iteration (ms): 5627.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.873119E+00 | loss scale: 1.0 | grad norm: 0.271 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:37:34.744757 | finish at 2025-09-10 11:51:06 + [2025-09-10 06:13:37] iteration 8322/ 11920 | consumed samples: 8521728 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.897771E+00 | loss scale: 1.0 | grad norm: 0.345 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:37:34.374516 | finish at 2025-09-10 11:51:11 + [2025-09-10 06:13:42] iteration 8323/ 11920 | consumed samples: 8522752 | elapsed time per iteration (ms): 5631.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.875930E+00 | loss scale: 1.0 | grad norm: 0.356 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:37:37.401704 | finish at 2025-09-10 11:51:20 + [2025-09-10 06:13:48] iteration 8324/ 11920 | consumed samples: 8523776 | elapsed time per iteration (ms): 5947.4 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.883722E+00 | loss scale: 1.0 | grad norm: 0.336 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:56:26.866167 | finish at 2025-09-10 12:10:15 + [2025-09-10 06:13:54] iteration 8325/ 11920 | consumed samples: 8524800 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.881834E+00 | loss scale: 1.0 | grad norm: 0.302 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:36:55.586349 | finish at 2025-09-10 11:50:49 + [2025-09-10 06:14:00] iteration 8326/ 11920 | consumed samples: 8525824 | elapsed time per iteration (ms): 5879.5 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.884805E+00 | loss scale: 1.0 | grad norm: 0.282 | num zeros: 10.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:52:10.778729 | finish at 2025-09-10 12:06:11 + [2025-09-10 06:14:05] iteration 8327/ 11920 | consumed samples: 8526848 | elapsed time per iteration (ms): 5636.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.884202E+00 | loss scale: 1.0 | grad norm: 0.277 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:37:30.562318 | finish at 2025-09-10 11:51:36 + [2025-09-10 06:14:11] iteration 8328/ 11920 | consumed samples: 8527872 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.892789E+00 | loss scale: 1.0 | grad norm: 0.297 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:36:42.543846 | finish at 2025-09-10 11:50:54 + [2025-09-10 06:14:17] iteration 8329/ 11920 | consumed samples: 8528896 | elapsed time per iteration (ms): 5619.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.864546E+00 | loss scale: 1.0 | grad norm: 0.408 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:36:20.148190 | finish at 2025-09-10 11:50:37 + [2025-09-10 06:14:22] iteration 8330/ 11920 | consumed samples: 8529920 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.872819E+00 | loss scale: 1.0 | grad norm: 0.518 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:36:24.587348 | finish at 2025-09-10 11:50:47 + [2025-09-10 06:14:28] iteration 8331/ 11920 | consumed samples: 8530944 | elapsed time per iteration (ms): 5630.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867728E+00 | loss scale: 1.0 | grad norm: 0.417 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:36:46.977436 | finish at 2025-09-10 11:51:15 + [2025-09-10 06:14:34] iteration 8332/ 11920 | consumed samples: 8531968 | elapsed time per iteration (ms): 5625.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.883657E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:36:25.568484 | finish at 2025-09-10 11:50:59 + [2025-09-10 06:14:39] iteration 8333/ 11920 | consumed samples: 8532992 | elapsed time per iteration (ms): 5627.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.872037E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:36:24.390563 | finish at 2025-09-10 11:51:04 + [2025-09-10 06:14:45] iteration 8334/ 11920 | consumed samples: 8534016 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.869663E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:36:11.053357 | finish at 2025-09-10 11:50:56 + [2025-09-10 06:14:50] iteration 8335/ 11920 | consumed samples: 8535040 | elapsed time per iteration (ms): 5618.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870635E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:35:42.802838 | finish at 2025-09-10 11:50:33 + [2025-09-10 06:14:56] iteration 8336/ 11920 | consumed samples: 8536064 | elapsed time per iteration (ms): 5612.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.879408E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:35:16.427734 | finish at 2025-09-10 11:50:12 + [2025-09-10 06:15:02] iteration 8337/ 11920 | consumed samples: 8537088 | elapsed time per iteration (ms): 5901.9 | throughput per GPU (TFLOP/s/GPU): 76.5 | MFU 7.73% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.878379E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:52:26.566833 | finish at 2025-09-10 12:07:29 + [2025-09-10 06:15:08] iteration 8338/ 11920 | consumed samples: 8538112 | elapsed time per iteration (ms): 5618.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.874753E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:35:26.590864 | finish at 2025-09-10 11:50:34 + [2025-09-10 06:15:13] iteration 8339/ 11920 | consumed samples: 8539136 | elapsed time per iteration (ms): 5627.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.876055E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:35:53.144072 | finish at 2025-09-10 11:51:06 + [2025-09-10 06:15:19] iteration 8340/ 11920 | consumed samples: 8540160 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.884047E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:35:22.507596 | finish at 2025-09-10 11:50:41 + [2025-09-10 06:15:25] iteration 8341/ 11920 | consumed samples: 8541184 | elapsed time per iteration (ms): 6165.3 | throughput per GPU (TFLOP/s/GPU): 73.2 | MFU 7.40% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855870E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:07:45.645007 | finish at 2025-09-10 12:23:11 + [2025-09-10 06:15:31] iteration 8342/ 11920 | consumed samples: 8542208 | elapsed time per iteration (ms): 5973.0 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867583E+00 | loss scale: 1.0 | grad norm: 0.271 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:56:11.296075 | finish at 2025-09-10 12:11:42 + [2025-09-10 06:15:37] iteration 8343/ 11920 | consumed samples: 8543232 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.873473E+00 | loss scale: 1.0 | grad norm: 0.511 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:35:28.917853 | finish at 2025-09-10 11:51:05 + [2025-09-10 06:15:42] iteration 8344/ 11920 | consumed samples: 8544256 | elapsed time per iteration (ms): 5627.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.873455E+00 | loss scale: 1.0 | grad norm: 0.500 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:35:24.974390 | finish at 2025-09-10 11:51:07 +(min, max) time across ranks (ms): + save-checkpoint ................................: (3879.07, 3879.17) + [2025-09-10 06:15:52] iteration 8345/ 11920 | consumed samples: 8545280 | elapsed time per iteration (ms): 5605.8 | throughput per GPU (TFLOP/s/GPU): 80.5 | MFU 8.14% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861793E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:34:00.613657 | finish at 2025-09-10 11:49:52 + [2025-09-10 06:15:57] iteration 8346/ 11920 | consumed samples: 8546304 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862257E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:34:59.046364 | finish at 2025-09-10 11:50:56 + [2025-09-10 06:16:03] iteration 8347/ 11920 | consumed samples: 8547328 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854481E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:34:50.448803 | finish at 2025-09-10 11:50:53 + [2025-09-10 06:16:09] iteration 8348/ 11920 | consumed samples: 8548352 | elapsed time per iteration (ms): 5616.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862611E+00 | loss scale: 1.0 | grad norm: 0.282 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:34:21.318377 | finish at 2025-09-10 11:50:30 + [2025-09-10 06:16:14] iteration 8349/ 11920 | consumed samples: 8549376 | elapsed time per iteration (ms): 5631.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.873434E+00 | loss scale: 1.0 | grad norm: 0.398 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:35:08.963533 | finish at 2025-09-10 11:51:23 + [2025-09-10 06:16:20] iteration 8350/ 11920 | consumed samples: 8550400 | elapsed time per iteration (ms): 5982.4 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.865457E+00 | loss scale: 1.0 | grad norm: 0.345 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:55:57.309930 | finish at 2025-09-10 12:12:17 + [2025-09-10 06:16:26] iteration 8351/ 11920 | consumed samples: 8551424 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.866585E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:34:22.595776 | finish at 2025-09-10 11:50:48 + [2025-09-10 06:16:31] iteration 8352/ 11920 | consumed samples: 8552448 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862373E+00 | loss scale: 1.0 | grad norm: 0.272 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:34:19.966259 | finish at 2025-09-10 11:50:51 + [2025-09-10 06:16:37] iteration 8353/ 11920 | consumed samples: 8553472 | elapsed time per iteration (ms): 5876.1 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.866548E+00 | loss scale: 1.0 | grad norm: 0.423 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:49:20.030216 | finish at 2025-09-10 12:05:57 + [2025-09-10 06:16:43] iteration 8354/ 11920 | consumed samples: 8554496 | elapsed time per iteration (ms): 5629.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.891255E+00 | loss scale: 1.0 | grad norm: 0.792 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:34:34.475311 | finish at 2025-09-10 11:51:17 + [2025-09-10 06:16:49] iteration 8355/ 11920 | consumed samples: 8555520 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870343E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:33:56.074758 | finish at 2025-09-10 11:50:45 + [2025-09-10 06:16:54] iteration 8356/ 11920 | consumed samples: 8556544 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.874412E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:33:49.546185 | finish at 2025-09-10 11:50:44 + [2025-09-10 06:17:00] iteration 8357/ 11920 | consumed samples: 8557568 | elapsed time per iteration (ms): 5618.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855212E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:33:39.164857 | finish at 2025-09-10 11:50:39 + [2025-09-10 06:17:05] iteration 8358/ 11920 | consumed samples: 8558592 | elapsed time per iteration (ms): 5616.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867259E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:33:26.232516 | finish at 2025-09-10 11:50:32 + [2025-09-10 06:17:11] iteration 8359/ 11920 | consumed samples: 8559616 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.866925E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:33:50.382184 | finish at 2025-09-10 11:51:01 + [2025-09-10 06:17:17] iteration 8360/ 11920 | consumed samples: 8560640 | elapsed time per iteration (ms): 5616.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870270E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:33:16.309023 | finish at 2025-09-10 11:50:33 + [2025-09-10 06:17:22] iteration 8361/ 11920 | consumed samples: 8561664 | elapsed time per iteration (ms): 5617.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858830E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:33:12.507940 | finish at 2025-09-10 11:50:35 + [2025-09-10 06:17:28] iteration 8362/ 11920 | consumed samples: 8562688 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.884461E+00 | loss scale: 1.0 | grad norm: 0.253 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:33:15.080761 | finish at 2025-09-10 11:50:43 + [2025-09-10 06:17:33] iteration 8363/ 11920 | consumed samples: 8563712 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.880715E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:33:18.467351 | finish at 2025-09-10 11:50:52 + [2025-09-10 06:17:39] iteration 8364/ 11920 | consumed samples: 8564736 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848082E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:33:09.363084 | finish at 2025-09-10 11:50:48 + [2025-09-10 06:17:45] iteration 8365/ 11920 | consumed samples: 8565760 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860889E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:33:01.894906 | finish at 2025-09-10 11:50:47 + [2025-09-10 06:17:50] iteration 8366/ 11920 | consumed samples: 8566784 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852186E+00 | loss scale: 1.0 | grad norm: 0.326 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:33:00.969228 | finish at 2025-09-10 11:50:51 + [2025-09-10 06:17:56] iteration 8367/ 11920 | consumed samples: 8567808 | elapsed time per iteration (ms): 5617.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854251E+00 | loss scale: 1.0 | grad norm: 0.533 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:32:38.886250 | finish at 2025-09-10 11:50:35 + [2025-09-10 06:18:02] iteration 8368/ 11920 | consumed samples: 8568832 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862024E+00 | loss scale: 1.0 | grad norm: 0.355 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:32:50.165382 | finish at 2025-09-10 11:50:52 + [2025-09-10 06:18:07] iteration 8369/ 11920 | consumed samples: 8569856 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862969E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:32:37.351077 | finish at 2025-09-10 11:50:45 + [2025-09-10 06:18:13] iteration 8370/ 11920 | consumed samples: 8570880 | elapsed time per iteration (ms): 5633.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846687E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:33:19.100554 | finish at 2025-09-10 11:51:32 + [2025-09-10 06:18:18] iteration 8371/ 11920 | consumed samples: 8571904 | elapsed time per iteration (ms): 5617.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860765E+00 | loss scale: 1.0 | grad norm: 0.301 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:32:14.744365 | finish at 2025-09-10 11:50:33 + [2025-09-10 06:18:24] iteration 8372/ 11920 | consumed samples: 8572928 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.863188E+00 | loss scale: 1.0 | grad norm: 0.491 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:32:22.672059 | finish at 2025-09-10 11:50:47 + [2025-09-10 06:18:30] iteration 8373/ 11920 | consumed samples: 8573952 | elapsed time per iteration (ms): 5630.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857775E+00 | loss scale: 1.0 | grad norm: 0.606 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:32:51.588430 | finish at 2025-09-10 11:51:21 + [2025-09-10 06:18:36] iteration 8374/ 11920 | consumed samples: 8574976 | elapsed time per iteration (ms): 5965.5 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867317E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:52:33.786846 | finish at 2025-09-10 12:11:09 + [2025-09-10 06:18:41] iteration 8375/ 11920 | consumed samples: 8576000 | elapsed time per iteration (ms): 5632.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.868941E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:32:47.733748 | finish at 2025-09-10 11:51:29 + [2025-09-10 06:18:47] iteration 8376/ 11920 | consumed samples: 8577024 | elapsed time per iteration (ms): 5616.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.868674E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:31:46.019718 | finish at 2025-09-10 11:50:33 + [2025-09-10 06:18:53] iteration 8377/ 11920 | consumed samples: 8578048 | elapsed time per iteration (ms): 5615.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849980E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:31:33.771022 | finish at 2025-09-10 11:50:26 + [2025-09-10 06:18:58] iteration 8378/ 11920 | consumed samples: 8579072 | elapsed time per iteration (ms): 5618.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861336E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:31:41.601015 | finish at 2025-09-10 11:50:40 + [2025-09-10 06:19:04] iteration 8379/ 11920 | consumed samples: 8580096 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862265E+00 | loss scale: 1.0 | grad norm: 0.130 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:31:43.086550 | finish at 2025-09-10 11:50:47 + [2025-09-10 06:19:09] iteration 8380/ 11920 | consumed samples: 8581120 | elapsed time per iteration (ms): 5618.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858481E+00 | loss scale: 1.0 | grad norm: 0.127 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:31:28.366613 | finish at 2025-09-10 11:50:38 + [2025-09-10 06:19:15] iteration 8381/ 11920 | consumed samples: 8582144 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.873703E+00 | loss scale: 1.0 | grad norm: 0.115 | num zeros: 10.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:31:37.218128 | finish at 2025-09-10 11:50:52 + [2025-09-10 06:19:21] iteration 8382/ 11920 | consumed samples: 8583168 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862590E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:31:23.801687 | finish at 2025-09-10 11:50:44 + [2025-09-10 06:19:27] iteration 8383/ 11920 | consumed samples: 8584192 | elapsed time per iteration (ms): 5881.4 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856370E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:46:42.540855 | finish at 2025-09-10 12:06:09 + [2025-09-10 06:19:32] iteration 8384/ 11920 | consumed samples: 8585216 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862220E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:31:29.413239 | finish at 2025-09-10 11:51:02 + [2025-09-10 06:19:38] iteration 8385/ 11920 | consumed samples: 8586240 | elapsed time per iteration (ms): 5615.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843901E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:30:50.508379 | finish at 2025-09-10 11:50:28 + [2025-09-10 06:19:43] iteration 8386/ 11920 | consumed samples: 8587264 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861089E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:31:27.579304 | finish at 2025-09-10 11:51:11 + [2025-09-10 06:19:49] iteration 8387/ 11920 | consumed samples: 8588288 | elapsed time per iteration (ms): 5627.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861863E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:31:23.061158 | finish at 2025-09-10 11:51:12 + [2025-09-10 06:19:55] iteration 8388/ 11920 | consumed samples: 8589312 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857228E+00 | loss scale: 1.0 | grad norm: 0.330 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:30:49.816017 | finish at 2025-09-10 11:50:44 + [2025-09-10 06:20:00] iteration 8389/ 11920 | consumed samples: 8590336 | elapsed time per iteration (ms): 5622.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861167E+00 | loss scale: 1.0 | grad norm: 0.369 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:30:53.396668 | finish at 2025-09-10 11:50:54 + [2025-09-10 06:20:06] iteration 8390/ 11920 | consumed samples: 8591360 | elapsed time per iteration (ms): 5617.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850244E+00 | loss scale: 1.0 | grad norm: 0.254 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:30:30.445163 | finish at 2025-09-10 11:50:36 + [2025-09-10 06:20:12] iteration 8391/ 11920 | consumed samples: 8592384 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.863247E+00 | loss scale: 1.0 | grad norm: 0.249 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:30:37.081320 | finish at 2025-09-10 11:50:49 + [2025-09-10 06:20:17] iteration 8392/ 11920 | consumed samples: 8593408 | elapsed time per iteration (ms): 5946.8 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858699E+00 | loss scale: 1.0 | grad norm: 0.375 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:49:40.208599 | finish at 2025-09-10 12:09:58 + [2025-09-10 06:20:23] iteration 8393/ 11920 | consumed samples: 8594432 | elapsed time per iteration (ms): 5618.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856089E+00 | loss scale: 1.0 | grad norm: 0.398 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:30:14.539791 | finish at 2025-09-10 11:50:38 + [2025-09-10 06:20:29] iteration 8394/ 11920 | consumed samples: 8595456 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.853318E+00 | loss scale: 1.0 | grad norm: 0.299 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:30:27.801462 | finish at 2025-09-10 11:50:57 + [2025-09-10 06:20:34] iteration 8395/ 11920 | consumed samples: 8596480 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860920E+00 | loss scale: 1.0 | grad norm: 0.257 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:30:30.560553 | finish at 2025-09-10 11:51:05 + [2025-09-10 06:20:40] iteration 8396/ 11920 | consumed samples: 8597504 | elapsed time per iteration (ms): 5630.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.853417E+00 | loss scale: 1.0 | grad norm: 0.273 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:30:40.136367 | finish at 2025-09-10 11:51:20 + [2025-09-10 06:20:46] iteration 8397/ 11920 | consumed samples: 8598528 | elapsed time per iteration (ms): 5616.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855550E+00 | loss scale: 1.0 | grad norm: 0.261 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:29:45.351727 | finish at 2025-09-10 11:50:31 + [2025-09-10 06:20:51] iteration 8398/ 11920 | consumed samples: 8599552 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849685E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:30:00.239719 | finish at 2025-09-10 11:50:51 + [2025-09-10 06:20:57] iteration 8399/ 11920 | consumed samples: 8600576 | elapsed time per iteration (ms): 5616.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855636E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:29:34.161596 | finish at 2025-09-10 11:50:31 + [2025-09-10 06:21:03] iteration 8400/ 11920 | consumed samples: 8601600 | elapsed time per iteration (ms): 6301.7 | throughput per GPU (TFLOP/s/GPU): 71.6 | MFU 7.24% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850334E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:09:42.026367 | finish at 2025-09-10 12:30:45 + [2025-09-10 06:21:09] iteration 8401/ 11920 | consumed samples: 8602624 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850067E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:29:46.329877 | finish at 2025-09-10 11:50:55 + [2025-09-10 06:21:15] iteration 8402/ 11920 | consumed samples: 8603648 | elapsed time per iteration (ms): 5953.3 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844964E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:49:03.730366 | finish at 2025-09-10 12:10:18 + [2025-09-10 06:21:20] iteration 8403/ 11920 | consumed samples: 8604672 | elapsed time per iteration (ms): 5618.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842604E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:29:20.895046 | finish at 2025-09-10 11:50:41 + [2025-09-10 06:21:26] iteration 8404/ 11920 | consumed samples: 8605696 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852151E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:29:32.302666 | finish at 2025-09-10 11:50:58 + [2025-09-10 06:21:32] iteration 8405/ 11920 | consumed samples: 8606720 | elapsed time per iteration (ms): 5629.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859340E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:29:47.732418 | finish at 2025-09-10 11:51:19 + [2025-09-10 06:21:37] iteration 8406/ 11920 | consumed samples: 8607744 | elapsed time per iteration (ms): 5619.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858058E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:29:05.473326 | finish at 2025-09-10 11:50:43 + [2025-09-10 06:21:43] iteration 8407/ 11920 | consumed samples: 8608768 | elapsed time per iteration (ms): 5628.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.853878E+00 | loss scale: 1.0 | grad norm: 0.137 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:29:33.884483 | finish at 2025-09-10 11:51:17 + [2025-09-10 06:21:49] iteration 8408/ 11920 | consumed samples: 8609792 | elapsed time per iteration (ms): 5936.0 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858167E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:47:27.381426 | finish at 2025-09-10 12:09:16 + [2025-09-10 06:21:55] iteration 8409/ 11920 | consumed samples: 8610816 | elapsed time per iteration (ms): 5951.5 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858019E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:48:15.609976 | finish at 2025-09-10 12:10:10 + [2025-09-10 06:22:00] iteration 8410/ 11920 | consumed samples: 8611840 | elapsed time per iteration (ms): 5614.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861995E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:28:27.288480 | finish at 2025-09-10 11:50:28 + [2025-09-10 06:22:06] iteration 8411/ 11920 | consumed samples: 8612864 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850936E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:28:54.202132 | finish at 2025-09-10 11:51:00 + [2025-09-10 06:22:12] iteration 8412/ 11920 | consumed samples: 8613888 | elapsed time per iteration (ms): 6374.3 | throughput per GPU (TFLOP/s/GPU): 70.8 | MFU 7.16% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846586E+00 | loss scale: 1.0 | grad norm: 0.229 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 6:12:40.947392 | finish at 2025-09-10 12:34:53 + [2025-09-10 06:22:18] iteration 8413/ 11920 | consumed samples: 8614912 | elapsed time per iteration (ms): 5994.4 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.871865E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:50:22.486306 | finish at 2025-09-10 12:12:41 + [2025-09-10 06:22:24] iteration 8414/ 11920 | consumed samples: 8615936 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840871E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:28:35.166352 | finish at 2025-09-10 11:50:59 + [2025-09-10 06:22:30] iteration 8415/ 11920 | consumed samples: 8616960 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854404E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:28:29.453672 | finish at 2025-09-10 11:50:59 + [2025-09-10 06:22:35] iteration 8416/ 11920 | consumed samples: 8617984 | elapsed time per iteration (ms): 5626.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856251E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:28:35.931473 | finish at 2025-09-10 11:51:11 + [2025-09-10 06:22:41] iteration 8417/ 11920 | consumed samples: 8619008 | elapsed time per iteration (ms): 5626.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837737E+00 | loss scale: 1.0 | grad norm: 0.137 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:28:30.143590 | finish at 2025-09-10 11:51:11 + [2025-09-10 06:22:46] iteration 8418/ 11920 | consumed samples: 8620032 | elapsed time per iteration (ms): 5626.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849580E+00 | loss scale: 1.0 | grad norm: 0.114 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:28:24.639680 | finish at 2025-09-10 11:51:11 + [2025-09-10 06:22:52] iteration 8419/ 11920 | consumed samples: 8621056 | elapsed time per iteration (ms): 5619.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852784E+00 | loss scale: 1.0 | grad norm: 0.128 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:27:52.450230 | finish at 2025-09-10 11:50:45 + [2025-09-10 06:22:58] iteration 8420/ 11920 | consumed samples: 8622080 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859359E+00 | loss scale: 1.0 | grad norm: 0.110 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:27:51.001792 | finish at 2025-09-10 11:50:49 + [2025-09-10 06:23:03] iteration 8421/ 11920 | consumed samples: 8623104 | elapsed time per iteration (ms): 5618.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848707E+00 | loss scale: 1.0 | grad norm: 0.113 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:27:39.599481 | finish at 2025-09-10 11:50:43 + [2025-09-10 06:23:09] iteration 8422/ 11920 | consumed samples: 8624128 | elapsed time per iteration (ms): 5844.3 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851749E+00 | loss scale: 1.0 | grad norm: 0.109 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:40:43.420724 | finish at 2025-09-10 12:03:53 + [2025-09-10 06:23:15] iteration 8423/ 11920 | consumed samples: 8625152 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830181E+00 | loss scale: 1.0 | grad norm: 0.107 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:27:39.464426 | finish at 2025-09-10 11:50:54 + [2025-09-10 06:23:20] iteration 8424/ 11920 | consumed samples: 8626176 | elapsed time per iteration (ms): 5616.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859875E+00 | loss scale: 1.0 | grad norm: 0.101 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:27:14.515156 | finish at 2025-09-10 11:50:35 + [2025-09-10 06:23:26] iteration 8425/ 11920 | consumed samples: 8627200 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849627E+00 | loss scale: 1.0 | grad norm: 0.103 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:27:23.933619 | finish at 2025-09-10 11:50:50 + [2025-09-10 06:23:32] iteration 8426/ 11920 | consumed samples: 8628224 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841590E+00 | loss scale: 1.0 | grad norm: 0.114 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:27:24.321715 | finish at 2025-09-10 11:50:56 + [2025-09-10 06:23:37] iteration 8427/ 11920 | consumed samples: 8629248 | elapsed time per iteration (ms): 5627.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858385E+00 | loss scale: 1.0 | grad norm: 0.106 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:27:37.491458 | finish at 2025-09-10 11:51:15 + [2025-09-10 06:23:43] iteration 8428/ 11920 | consumed samples: 8630272 | elapsed time per iteration (ms): 5632.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835718E+00 | loss scale: 1.0 | grad norm: 0.111 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:27:46.893940 | finish at 2025-09-10 11:51:30 + [2025-09-10 06:23:48] iteration 8429/ 11920 | consumed samples: 8631296 | elapsed time per iteration (ms): 5618.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844605E+00 | loss scale: 1.0 | grad norm: 0.117 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:26:55.404492 | finish at 2025-09-10 11:50:44 + [2025-09-10 06:23:54] iteration 8430/ 11920 | consumed samples: 8632320 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838587E+00 | loss scale: 1.0 | grad norm: 0.124 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:26:55.264895 | finish at 2025-09-10 11:50:49 + [2025-09-10 06:24:00] iteration 8431/ 11920 | consumed samples: 8633344 | elapsed time per iteration (ms): 5879.7 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851458E+00 | loss scale: 1.0 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:41:54.260632 | finish at 2025-09-10 12:05:54 + [2025-09-10 06:24:06] iteration 8432/ 11920 | consumed samples: 8634368 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835283E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:26:54.199562 | finish at 2025-09-10 11:51:00 + [2025-09-10 06:24:11] iteration 8433/ 11920 | consumed samples: 8635392 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836604E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:26:40.703193 | finish at 2025-09-10 11:50:52 + [2025-09-10 06:24:17] iteration 8434/ 11920 | consumed samples: 8636416 | elapsed time per iteration (ms): 5629.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.847747E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:27:05.143152 | finish at 2025-09-10 11:51:22 + [2025-09-10 06:24:23] iteration 8435/ 11920 | consumed samples: 8637440 | elapsed time per iteration (ms): 5929.4 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844639E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:44:24.052776 | finish at 2025-09-10 12:08:47 + [2025-09-10 06:24:29] iteration 8436/ 11920 | consumed samples: 8638464 | elapsed time per iteration (ms): 5992.8 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844167E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:47:58.813021 | finish at 2025-09-10 12:12:28 + [2025-09-10 06:24:35] iteration 8437/ 11920 | consumed samples: 8639488 | elapsed time per iteration (ms): 6006.5 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838323E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:48:40.568935 | finish at 2025-09-10 12:13:15 + [2025-09-10 06:24:40] iteration 8438/ 11920 | consumed samples: 8640512 | elapsed time per iteration (ms): 5618.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842250E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:26:03.946559 | finish at 2025-09-10 11:50:44 + [2025-09-10 06:24:46] iteration 8439/ 11920 | consumed samples: 8641536 | elapsed time per iteration (ms): 5618.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836689E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:25:58.133759 | finish at 2025-09-10 11:50:44 + [2025-09-10 06:24:52] iteration 8440/ 11920 | consumed samples: 8642560 | elapsed time per iteration (ms): 5893.5 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856880E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:41:49.551229 | finish at 2025-09-10 12:06:41 + [2025-09-10 06:24:58] iteration 8441/ 11920 | consumed samples: 8643584 | elapsed time per iteration (ms): 5615.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860999E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:25:37.464083 | finish at 2025-09-10 11:50:35 + [2025-09-10 06:25:03] iteration 8442/ 11920 | consumed samples: 8644608 | elapsed time per iteration (ms): 5627.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849576E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:26:10.948457 | finish at 2025-09-10 11:51:14 + [2025-09-10 06:25:09] iteration 8443/ 11920 | consumed samples: 8645632 | elapsed time per iteration (ms): 5619.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855684E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:25:38.386953 | finish at 2025-09-10 11:50:47 + [2025-09-10 06:25:14] iteration 8444/ 11920 | consumed samples: 8646656 | elapsed time per iteration (ms): 5626.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850228E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:25:57.784894 | finish at 2025-09-10 11:51:12 + [2025-09-10 06:25:20] iteration 8445/ 11920 | consumed samples: 8647680 | elapsed time per iteration (ms): 5936.0 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.847623E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:43:47.578837 | finish at 2025-09-10 12:09:08 + [2025-09-10 06:25:26] iteration 8446/ 11920 | consumed samples: 8648704 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835369E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:25:33.357452 | finish at 2025-09-10 11:50:59 + [2025-09-10 06:25:32] iteration 8447/ 11920 | consumed samples: 8649728 | elapsed time per iteration (ms): 5979.3 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843590E+00 | loss scale: 1.0 | grad norm: 0.142 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:46:06.280379 | finish at 2025-09-10 12:11:38 + [2025-09-10 06:25:38] iteration 8448/ 11920 | consumed samples: 8650752 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846393E+00 | loss scale: 1.0 | grad norm: 0.122 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:25:19.554127 | finish at 2025-09-10 11:50:57 + [2025-09-10 06:25:43] iteration 8449/ 11920 | consumed samples: 8651776 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841810E+00 | loss scale: 1.0 | grad norm: 0.116 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:25:23.665789 | finish at 2025-09-10 11:51:07 + [2025-09-10 06:25:49] iteration 8450/ 11920 | consumed samples: 8652800 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854409E+00 | loss scale: 1.0 | grad norm: 0.116 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:25:08.244786 | finish at 2025-09-10 11:50:57 + [2025-09-10 06:25:54] iteration 8451/ 11920 | consumed samples: 8653824 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841084E+00 | loss scale: 1.0 | grad norm: 0.104 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:25:07.381798 | finish at 2025-09-10 11:51:02 + [2025-09-10 06:26:00] iteration 8452/ 11920 | consumed samples: 8654848 | elapsed time per iteration (ms): 5996.6 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859257E+00 | loss scale: 1.0 | grad norm: 0.112 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:46:36.139043 | finish at 2025-09-10 12:12:37 + [2025-09-10 06:26:06] iteration 8453/ 11920 | consumed samples: 8655872 | elapsed time per iteration (ms): 5619.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838262E+00 | loss scale: 1.0 | grad norm: 0.124 | num zeros: 11.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:24:41.076988 | finish at 2025-09-10 11:50:47 + [2025-09-10 06:26:12] iteration 8454/ 11920 | consumed samples: 8656896 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838933E+00 | loss scale: 1.0 | grad norm: 0.122 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:24:40.696275 | finish at 2025-09-10 11:50:52 + [2025-09-10 06:26:17] iteration 8455/ 11920 | consumed samples: 8657920 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843921E+00 | loss scale: 1.0 | grad norm: 0.133 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:24:35.130286 | finish at 2025-09-10 11:50:52 + [2025-09-10 06:26:23] iteration 8456/ 11920 | consumed samples: 8658944 | elapsed time per iteration (ms): 5617.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840044E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:24:19.329937 | finish at 2025-09-10 11:50:42 + [2025-09-10 06:26:29] iteration 8457/ 11920 | consumed samples: 8659968 | elapsed time per iteration (ms): 5618.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829066E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:24:16.046442 | finish at 2025-09-10 11:50:45 + [2025-09-10 06:26:35] iteration 8458/ 11920 | consumed samples: 8660992 | elapsed time per iteration (ms): 6191.0 | throughput per GPU (TFLOP/s/GPU): 72.9 | MFU 7.37% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845484E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:57:13.288170 | finish at 2025-09-10 12:23:48 + [2025-09-10 06:26:40] iteration 8459/ 11920 | consumed samples: 8662016 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850648E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:24:14.850543 | finish at 2025-09-10 11:50:55 + [2025-09-10 06:26:46] iteration 8460/ 11920 | consumed samples: 8663040 | elapsed time per iteration (ms): 5613.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837606E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:23:42.637815 | finish at 2025-09-10 11:50:29 + [2025-09-10 06:26:52] iteration 8461/ 11920 | consumed samples: 8664064 | elapsed time per iteration (ms): 5893.0 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851738E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:39:43.988872 | finish at 2025-09-10 12:06:36 + [2025-09-10 06:26:57] iteration 8462/ 11920 | consumed samples: 8665088 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846668E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:24:06.368423 | finish at 2025-09-10 11:51:04 + [2025-09-10 06:27:03] iteration 8463/ 11920 | consumed samples: 8666112 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834740E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:23:54.714068 | finish at 2025-09-10 11:50:58 + [2025-09-10 06:27:09] iteration 8464/ 11920 | consumed samples: 8667136 | elapsed time per iteration (ms): 5862.1 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839661E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:37:39.448517 | finish at 2025-09-10 12:04:48 + [2025-09-10 06:27:15] iteration 8465/ 11920 | consumed samples: 8668160 | elapsed time per iteration (ms): 5628.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831590E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:24:04.787025 | finish at 2025-09-10 11:51:19 + [2025-09-10 06:27:20] iteration 8466/ 11920 | consumed samples: 8669184 | elapsed time per iteration (ms): 5619.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839743E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:23:27.984680 | finish at 2025-09-10 11:50:48 + [2025-09-10 06:27:26] iteration 8467/ 11920 | consumed samples: 8670208 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839263E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:23:37.133318 | finish at 2025-09-10 11:51:03 + [2025-09-10 06:27:31] iteration 8468/ 11920 | consumed samples: 8671232 | elapsed time per iteration (ms): 5630.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858371E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:23:55.921678 | finish at 2025-09-10 11:51:27 + [2025-09-10 06:27:37] iteration 8469/ 11920 | consumed samples: 8672256 | elapsed time per iteration (ms): 5619.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842345E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:23:11.226450 | finish at 2025-09-10 11:50:48 + [2025-09-10 06:27:43] iteration 8470/ 11920 | consumed samples: 8673280 | elapsed time per iteration (ms): 5637.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840344E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:24:09.087274 | finish at 2025-09-10 11:51:52 + [2025-09-10 06:27:48] iteration 8471/ 11920 | consumed samples: 8674304 | elapsed time per iteration (ms): 5628.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849423E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:23:32.692336 | finish at 2025-09-10 11:51:21 + [2025-09-10 06:27:54] iteration 8472/ 11920 | consumed samples: 8675328 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841978E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:23:00.349117 | finish at 2025-09-10 11:50:54 + [2025-09-10 06:28:00] iteration 8473/ 11920 | consumed samples: 8676352 | elapsed time per iteration (ms): 5618.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823884E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:22:47.132202 | finish at 2025-09-10 11:50:47 + [2025-09-10 06:28:05] iteration 8474/ 11920 | consumed samples: 8677376 | elapsed time per iteration (ms): 5617.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844354E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:22:36.797726 | finish at 2025-09-10 11:50:42 + [2025-09-10 06:28:11] iteration 8475/ 11920 | consumed samples: 8678400 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849767E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:22:40.323838 | finish at 2025-09-10 11:50:51 + [2025-09-10 06:28:16] iteration 8476/ 11920 | consumed samples: 8679424 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.864511E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:22:52.629736 | finish at 2025-09-10 11:51:09 + [2025-09-10 06:28:22] iteration 8477/ 11920 | consumed samples: 8680448 | elapsed time per iteration (ms): 5615.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851518E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:22:13.970219 | finish at 2025-09-10 11:50:36 + [2025-09-10 06:28:28] iteration 8478/ 11920 | consumed samples: 8681472 | elapsed time per iteration (ms): 5618.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848277E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:22:20.377925 | finish at 2025-09-10 11:50:48 + [2025-09-10 06:28:34] iteration 8479/ 11920 | consumed samples: 8682496 | elapsed time per iteration (ms): 5857.0 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.847052E+00 | loss scale: 1.0 | grad norm: 0.137 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:35:53.811866 | finish at 2025-09-10 12:04:27 + [2025-09-10 06:28:39] iteration 8480/ 11920 | consumed samples: 8683520 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850784E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:22:26.346188 | finish at 2025-09-10 11:51:06 + [2025-09-10 06:28:45] iteration 8481/ 11920 | consumed samples: 8684544 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858700E+00 | loss scale: 1.0 | grad norm: 0.135 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:22:07.213223 | finish at 2025-09-10 11:50:52 + [2025-09-10 06:28:50] iteration 8482/ 11920 | consumed samples: 8685568 | elapsed time per iteration (ms): 5617.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816321E+00 | loss scale: 1.0 | grad norm: 0.125 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:21:51.534883 | finish at 2025-09-10 11:50:42 + [2025-09-10 06:28:56] iteration 8483/ 11920 | consumed samples: 8686592 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842872E+00 | loss scale: 1.0 | grad norm: 0.136 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:22:03.345747 | finish at 2025-09-10 11:50:59 + [2025-09-10 06:29:02] iteration 8484/ 11920 | consumed samples: 8687616 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831840E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:21:50.335991 | finish at 2025-09-10 11:50:52 + [2025-09-10 06:29:07] iteration 8485/ 11920 | consumed samples: 8688640 | elapsed time per iteration (ms): 5616.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844080E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:21:32.311081 | finish at 2025-09-10 11:50:40 + [2025-09-10 06:29:13] iteration 8486/ 11920 | consumed samples: 8689664 | elapsed time per iteration (ms): 5949.0 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832610E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:40:28.714129 | finish at 2025-09-10 12:09:42 + [2025-09-10 06:29:19] iteration 8487/ 11920 | consumed samples: 8690688 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836801E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:21:40.670513 | finish at 2025-09-10 11:51:00 + [2025-09-10 06:29:24] iteration 8488/ 11920 | consumed samples: 8691712 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822892E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:21:45.321573 | finish at 2025-09-10 11:51:10 + [2025-09-10 06:29:30] iteration 8489/ 11920 | consumed samples: 8692736 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834846E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:21:25.095745 | finish at 2025-09-10 11:50:55 + [2025-09-10 06:29:36] iteration 8490/ 11920 | consumed samples: 8693760 | elapsed time per iteration (ms): 5628.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843549E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:21:44.039254 | finish at 2025-09-10 11:51:20 + [2025-09-10 06:29:41] iteration 8491/ 11920 | consumed samples: 8694784 | elapsed time per iteration (ms): 5626.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849010E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:21:33.953223 | finish at 2025-09-10 11:51:15 + [2025-09-10 06:29:47] iteration 8492/ 11920 | consumed samples: 8695808 | elapsed time per iteration (ms): 5626.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851592E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:21:26.021741 | finish at 2025-09-10 11:51:13 + [2025-09-10 06:29:53] iteration 8493/ 11920 | consumed samples: 8696832 | elapsed time per iteration (ms): 5842.7 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836654E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:33:42.974586 | finish at 2025-09-10 12:03:36 + [2025-09-10 06:29:58] iteration 8494/ 11920 | consumed samples: 8697856 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858979E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:20:54.928262 | finish at 2025-09-10 11:50:53 + [2025-09-10 06:30:04] iteration 8495/ 11920 | consumed samples: 8698880 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846530E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:20:58.479077 | finish at 2025-09-10 11:51:03 + [2025-09-10 06:30:10] iteration 8496/ 11920 | consumed samples: 8699904 | elapsed time per iteration (ms): 5618.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841173E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:20:38.492569 | finish at 2025-09-10 11:50:48 + [2025-09-10 06:30:15] iteration 8497/ 11920 | consumed samples: 8700928 | elapsed time per iteration (ms): 5628.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855121E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:21:04.875030 | finish at 2025-09-10 11:51:20 + [2025-09-10 06:30:21] iteration 8498/ 11920 | consumed samples: 8701952 | elapsed time per iteration (ms): 5841.1 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.833571E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:33:08.206593 | finish at 2025-09-10 12:03:29 + [2025-09-10 06:30:27] iteration 8499/ 11920 | consumed samples: 8702976 | elapsed time per iteration (ms): 5618.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837495E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:20:21.799539 | finish at 2025-09-10 11:50:49 + [2025-09-10 06:30:33] iteration 8500/ 11920 | consumed samples: 8704000 | elapsed time per iteration (ms): 5859.7 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844524E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:34:00.245633 | finish at 2025-09-10 12:04:33 + [2025-09-10 06:30:39] iteration 8501/ 11920 | consumed samples: 8705024 | elapsed time per iteration (ms): 5960.8 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836133E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:39:40.095612 | finish at 2025-09-10 12:10:19 + [2025-09-10 06:30:44] iteration 8502/ 11920 | consumed samples: 8706048 | elapsed time per iteration (ms): 5848.0 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840865E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:33:08.473134 | finish at 2025-09-10 12:03:53 + [2025-09-10 06:30:50] iteration 8503/ 11920 | consumed samples: 8707072 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842345E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:20:18.033515 | finish at 2025-09-10 11:51:08 + [2025-09-10 06:30:56] iteration 8504/ 11920 | consumed samples: 8708096 | elapsed time per iteration (ms): 5614.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852596E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:19:39.483179 | finish at 2025-09-10 11:50:35 + [2025-09-10 06:31:01] iteration 8505/ 11920 | consumed samples: 8709120 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843014E+00 | loss scale: 1.0 | grad norm: 0.128 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:19:56.750023 | finish at 2025-09-10 11:50:58 + [2025-09-10 06:31:07] iteration 8506/ 11920 | consumed samples: 8710144 | elapsed time per iteration (ms): 5619.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829251E+00 | loss scale: 1.0 | grad norm: 0.122 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:19:45.374830 | finish at 2025-09-10 11:50:52 + [2025-09-10 06:31:13] iteration 8507/ 11920 | consumed samples: 8711168 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838905E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:19:45.238075 | finish at 2025-09-10 11:50:58 + [2025-09-10 06:31:18] iteration 8508/ 11920 | consumed samples: 8712192 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836194E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:19:53.468860 | finish at 2025-09-10 11:51:12 + [2025-09-10 06:31:24] iteration 8509/ 11920 | consumed samples: 8713216 | elapsed time per iteration (ms): 6151.7 | throughput per GPU (TFLOP/s/GPU): 73.4 | MFU 7.42% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844092E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:49:43.452021 | finish at 2025-09-10 12:21:08 + [2025-09-10 06:31:30] iteration 8510/ 11920 | consumed samples: 8714240 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846539E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:19:32.537813 | finish at 2025-09-10 11:51:02 + [2025-09-10 06:31:36] iteration 8511/ 11920 | consumed samples: 8715264 | elapsed time per iteration (ms): 5947.9 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839230E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:37:56.304327 | finish at 2025-09-10 12:09:32 + [2025-09-10 06:31:42] iteration 8512/ 11920 | consumed samples: 8716288 | elapsed time per iteration (ms): 5629.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848752E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:19:45.182945 | finish at 2025-09-10 11:51:27 + [2025-09-10 06:31:47] iteration 8513/ 11920 | consumed samples: 8717312 | elapsed time per iteration (ms): 5619.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824595E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:19:05.273951 | finish at 2025-09-10 11:50:52 + [2025-09-10 06:31:53] iteration 8514/ 11920 | consumed samples: 8718336 | elapsed time per iteration (ms): 5617.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839241E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:18:51.983087 | finish at 2025-09-10 11:50:45 + [2025-09-10 06:31:58] iteration 8515/ 11920 | consumed samples: 8719360 | elapsed time per iteration (ms): 5617.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841514E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:18:48.375188 | finish at 2025-09-10 11:50:47 + [2025-09-10 06:32:04] iteration 8516/ 11920 | consumed samples: 8720384 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840710E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:18:55.829525 | finish at 2025-09-10 11:51:00 + [2025-09-10 06:32:10] iteration 8517/ 11920 | consumed samples: 8721408 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844248E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:18:47.494838 | finish at 2025-09-10 11:50:57 + [2025-09-10 06:32:15] iteration 8518/ 11920 | consumed samples: 8722432 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830920E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:18:52.587072 | finish at 2025-09-10 11:51:08 + [2025-09-10 06:32:21] iteration 8519/ 11920 | consumed samples: 8723456 | elapsed time per iteration (ms): 5619.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837646E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:18:31.564077 | finish at 2025-09-10 11:50:52 + [2025-09-10 06:32:26] iteration 8520/ 11920 | consumed samples: 8724480 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843042E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:18:29.233379 | finish at 2025-09-10 11:50:56 + [2025-09-10 06:32:32] iteration 8521/ 11920 | consumed samples: 8725504 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854726E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:18:34.454344 | finish at 2025-09-10 11:51:07 + [2025-09-10 06:32:38] iteration 8522/ 11920 | consumed samples: 8726528 | elapsed time per iteration (ms): 5619.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850271E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:18:14.820121 | finish at 2025-09-10 11:50:53 + [2025-09-10 06:32:44] iteration 8523/ 11920 | consumed samples: 8727552 | elapsed time per iteration (ms): 5943.2 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829434E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:36:28.890417 | finish at 2025-09-10 12:09:13 + [2025-09-10 06:32:49] iteration 8524/ 11920 | consumed samples: 8728576 | elapsed time per iteration (ms): 5614.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840365E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:17:47.553855 | finish at 2025-09-10 11:50:37 + [2025-09-10 06:32:55] iteration 8525/ 11920 | consumed samples: 8729600 | elapsed time per iteration (ms): 5972.4 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830629E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:37:56.365013 | finish at 2025-09-10 12:10:52 + [2025-09-10 06:33:01] iteration 8526/ 11920 | consumed samples: 8730624 | elapsed time per iteration (ms): 5628.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842696E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:18:22.276868 | finish at 2025-09-10 11:51:23 + [2025-09-10 06:33:06] iteration 8527/ 11920 | consumed samples: 8731648 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829238E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:17:55.669200 | finish at 2025-09-10 11:51:02 + [2025-09-10 06:33:12] iteration 8528/ 11920 | consumed samples: 8732672 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834098E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:17:44.741150 | finish at 2025-09-10 11:50:57 + [2025-09-10 06:33:18] iteration 8529/ 11920 | consumed samples: 8733696 | elapsed time per iteration (ms): 5629.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834895E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:18:08.316384 | finish at 2025-09-10 11:51:26 + [2025-09-10 06:33:23] iteration 8530/ 11920 | consumed samples: 8734720 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845222E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:17:41.448369 | finish at 2025-09-10 11:51:05 + [2025-09-10 06:33:29] iteration 8531/ 11920 | consumed samples: 8735744 | elapsed time per iteration (ms): 5871.6 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841705E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:31:38.820592 | finish at 2025-09-10 12:05:08 + [2025-09-10 06:33:35] iteration 8532/ 11920 | consumed samples: 8736768 | elapsed time per iteration (ms): 5626.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838745E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:17:43.078405 | finish at 2025-09-10 11:51:18 + [2025-09-10 06:33:40] iteration 8533/ 11920 | consumed samples: 8737792 | elapsed time per iteration (ms): 5618.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834649E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:17:08.989778 | finish at 2025-09-10 11:50:49 + [2025-09-10 06:33:46] iteration 8534/ 11920 | consumed samples: 8738816 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837434E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:17:16.762783 | finish at 2025-09-10 11:51:03 + [2025-09-10 06:33:52] iteration 8535/ 11920 | consumed samples: 8739840 | elapsed time per iteration (ms): 5616.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843374E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:16:52.835147 | finish at 2025-09-10 11:50:45 + [2025-09-10 06:33:57] iteration 8536/ 11920 | consumed samples: 8740864 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828794E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:17:09.257137 | finish at 2025-09-10 11:51:07 + [2025-09-10 06:34:03] iteration 8537/ 11920 | consumed samples: 8741888 | elapsed time per iteration (ms): 5613.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832883E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:16:29.246527 | finish at 2025-09-10 11:50:32 + [2025-09-10 06:34:09] iteration 8538/ 11920 | consumed samples: 8742912 | elapsed time per iteration (ms): 5615.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845417E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:16:32.885238 | finish at 2025-09-10 11:50:41 + [2025-09-10 06:34:14] iteration 8539/ 11920 | consumed samples: 8743936 | elapsed time per iteration (ms): 5631.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843584E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:17:18.312799 | finish at 2025-09-10 11:51:33 + [2025-09-10 06:34:20] iteration 8540/ 11920 | consumed samples: 8744960 | elapsed time per iteration (ms): 5638.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834394E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:17:37.615781 | finish at 2025-09-10 11:51:57 + [2025-09-10 06:34:25] iteration 8541/ 11920 | consumed samples: 8745984 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825433E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:16:30.036006 | finish at 2025-09-10 11:50:56 + [2025-09-10 06:34:31] iteration 8542/ 11920 | consumed samples: 8747008 | elapsed time per iteration (ms): 5628.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829156E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:16:52.610662 | finish at 2025-09-10 11:51:24 + [2025-09-10 06:34:37] iteration 8543/ 11920 | consumed samples: 8748032 | elapsed time per iteration (ms): 5632.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843449E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:16:59.097233 | finish at 2025-09-10 11:51:36 + [2025-09-10 06:34:43] iteration 8544/ 11920 | consumed samples: 8749056 | elapsed time per iteration (ms): 5930.0 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836635E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:33:39.553051 | finish at 2025-09-10 12:08:22 + [2025-09-10 06:34:48] iteration 8545/ 11920 | consumed samples: 8750080 | elapsed time per iteration (ms): 5617.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840184E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:15:59.771633 | finish at 2025-09-10 11:50:48 + [2025-09-10 06:34:54] iteration 8546/ 11920 | consumed samples: 8751104 | elapsed time per iteration (ms): 5627.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838408E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:16:28.231749 | finish at 2025-09-10 11:51:22 + [2025-09-10 06:35:00] iteration 8547/ 11920 | consumed samples: 8752128 | elapsed time per iteration (ms): 5967.8 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834551E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:35:29.368160 | finish at 2025-09-10 12:10:29 + [2025-09-10 06:35:05] iteration 8548/ 11920 | consumed samples: 8753152 | elapsed time per iteration (ms): 5617.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830187E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:15:41.669168 | finish at 2025-09-10 11:50:47 + [2025-09-10 06:35:11] iteration 8549/ 11920 | consumed samples: 8754176 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.847152E+00 | loss scale: 1.0 | grad norm: 0.127 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:15:54.593396 | finish at 2025-09-10 11:51:06 + [2025-09-10 06:35:17] iteration 8550/ 11920 | consumed samples: 8755200 | elapsed time per iteration (ms): 5629.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861371E+00 | loss scale: 1.0 | grad norm: 0.137 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:16:09.633410 | finish at 2025-09-10 11:51:26 + [2025-09-10 06:35:22] iteration 8551/ 11920 | consumed samples: 8756224 | elapsed time per iteration (ms): 5616.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824460E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:15:20.144748 | finish at 2025-09-10 11:50:43 + [2025-09-10 06:35:28] iteration 8552/ 11920 | consumed samples: 8757248 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837914E+00 | loss scale: 1.0 | grad norm: 0.126 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:15:26.534355 | finish at 2025-09-10 11:50:55 + [2025-09-10 06:35:34] iteration 8553/ 11920 | consumed samples: 8758272 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832624E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:15:47.103929 | finish at 2025-09-10 11:51:21 + [2025-09-10 06:35:40] iteration 8554/ 11920 | consumed samples: 8759296 | elapsed time per iteration (ms): 5952.0 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828652E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:33:54.298455 | finish at 2025-09-10 12:09:34 + [2025-09-10 06:35:45] iteration 8555/ 11920 | consumed samples: 8760320 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828682E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:15:36.433396 | finish at 2025-09-10 11:51:22 + [2025-09-10 06:35:51] iteration 8556/ 11920 | consumed samples: 8761344 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846256E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:15:16.532020 | finish at 2025-09-10 11:51:07 + [2025-09-10 06:35:56] iteration 8557/ 11920 | consumed samples: 8762368 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837237E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:15:00.490183 | finish at 2025-09-10 11:50:57 + [2025-09-10 06:36:02] iteration 8558/ 11920 | consumed samples: 8763392 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.847775E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:14:57.435859 | finish at 2025-09-10 11:50:59 + [2025-09-10 06:36:08] iteration 8559/ 11920 | consumed samples: 8764416 | elapsed time per iteration (ms): 5852.7 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851024E+00 | loss scale: 1.0 | grad norm: 0.254 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:27:50.838942 | finish at 2025-09-10 12:03:59 + [2025-09-10 06:36:14] iteration 8560/ 11920 | consumed samples: 8765440 | elapsed time per iteration (ms): 5944.5 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845444E+00 | loss scale: 1.0 | grad norm: 0.268 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:32:53.652878 | finish at 2025-09-10 12:09:08 + [2025-09-10 06:36:19] iteration 8561/ 11920 | consumed samples: 8766464 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823828E+00 | loss scale: 1.0 | grad norm: 0.249 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:14:53.663847 | finish at 2025-09-10 11:51:13 + [2025-09-10 06:36:25] iteration 8562/ 11920 | consumed samples: 8767488 | elapsed time per iteration (ms): 5617.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830507E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:14:23.122487 | finish at 2025-09-10 11:50:48 + [2025-09-10 06:36:31] iteration 8563/ 11920 | consumed samples: 8768512 | elapsed time per iteration (ms): 5619.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836661E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:14:25.015802 | finish at 2025-09-10 11:50:56 + [2025-09-10 06:36:37] iteration 8564/ 11920 | consumed samples: 8769536 | elapsed time per iteration (ms): 6010.3 | throughput per GPU (TFLOP/s/GPU): 75.1 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.833691E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:36:10.533730 | finish at 2025-09-10 12:12:47 + [2025-09-10 06:36:43] iteration 8565/ 11920 | consumed samples: 8770560 | elapsed time per iteration (ms): 5922.5 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843580E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:31:10.069537 | finish at 2025-09-10 12:07:53 + [2025-09-10 06:36:48] iteration 8566/ 11920 | consumed samples: 8771584 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831655E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:14:15.094000 | finish at 2025-09-10 11:51:03 + [2025-09-10 06:36:54] iteration 8567/ 11920 | consumed samples: 8772608 | elapsed time per iteration (ms): 5631.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837612E+00 | loss scale: 1.0 | grad norm: 0.229 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:14:41.476207 | finish at 2025-09-10 11:51:35 + [2025-09-10 06:37:00] iteration 8568/ 11920 | consumed samples: 8773632 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841460E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:14:13.145105 | finish at 2025-09-10 11:51:13 + [2025-09-10 06:37:05] iteration 8569/ 11920 | consumed samples: 8774656 | elapsed time per iteration (ms): 5617.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843627E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:13:45.365235 | finish at 2025-09-10 11:50:51 + [2025-09-10 06:37:11] iteration 8570/ 11920 | consumed samples: 8775680 | elapsed time per iteration (ms): 5614.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840060E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:13:27.542431 | finish at 2025-09-10 11:50:38 + [2025-09-10 06:37:16] iteration 8571/ 11920 | consumed samples: 8776704 | elapsed time per iteration (ms): 5635.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848850E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:14:31.798615 | finish at 2025-09-10 11:51:48 + [2025-09-10 06:37:22] iteration 8572/ 11920 | consumed samples: 8777728 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851139E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:13:36.595359 | finish at 2025-09-10 11:50:59 + [2025-09-10 06:37:28] iteration 8573/ 11920 | consumed samples: 8778752 | elapsed time per iteration (ms): 5821.0 | throughput per GPU (TFLOP/s/GPU): 77.6 | MFU 7.84% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839099E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:24:42.797159 | finish at 2025-09-10 12:02:11 + [2025-09-10 06:37:34] iteration 8574/ 11920 | consumed samples: 8779776 | elapsed time per iteration (ms): 6303.0 | throughput per GPU (TFLOP/s/GPU): 71.6 | MFU 7.24% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845663E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:51:29.778877 | finish at 2025-09-10 12:29:04 + [2025-09-10 06:37:40] iteration 8575/ 11920 | consumed samples: 8780800 | elapsed time per iteration (ms): 5630.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835784E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:13:53.324143 | finish at 2025-09-10 11:51:33 + [2025-09-10 06:37:45] iteration 8576/ 11920 | consumed samples: 8781824 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839069E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:13:32.678833 | finish at 2025-09-10 11:51:18 + [2025-09-10 06:37:51] iteration 8577/ 11920 | consumed samples: 8782848 | elapsed time per iteration (ms): 5880.0 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846302E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:27:36.975081 | finish at 2025-09-10 12:05:28 + [2025-09-10 06:37:57] iteration 8578/ 11920 | consumed samples: 8783872 | elapsed time per iteration (ms): 5630.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841779E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:13:36.604580 | finish at 2025-09-10 11:51:34 + [2025-09-10 06:38:03] iteration 8579/ 11920 | consumed samples: 8784896 | elapsed time per iteration (ms): 5619.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834310E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:12:53.821250 | finish at 2025-09-10 11:50:56 + [2025-09-10 06:38:08] iteration 8580/ 11920 | consumed samples: 8785920 | elapsed time per iteration (ms): 5619.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839646E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:12:49.611511 | finish at 2025-09-10 11:50:58 + [2025-09-10 06:38:14] iteration 8581/ 11920 | consumed samples: 8786944 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834396E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:12:48.873427 | finish at 2025-09-10 11:51:03 + [2025-09-10 06:38:19] iteration 8582/ 11920 | consumed samples: 8787968 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846972E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:12:57.771648 | finish at 2025-09-10 11:51:17 + [2025-09-10 06:38:25] iteration 8583/ 11920 | consumed samples: 8788992 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831683E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:12:40.896369 | finish at 2025-09-10 11:51:06 + [2025-09-10 06:38:31] iteration 8584/ 11920 | consumed samples: 8790016 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842323E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:12:32.178726 | finish at 2025-09-10 11:51:03 + [2025-09-10 06:38:36] iteration 8585/ 11920 | consumed samples: 8791040 | elapsed time per iteration (ms): 5628.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851933E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:12:52.113713 | finish at 2025-09-10 11:51:28 + [2025-09-10 06:38:42] iteration 8586/ 11920 | consumed samples: 8792064 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843083E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:12:23.138251 | finish at 2025-09-10 11:51:05 + [2025-09-10 06:38:48] iteration 8587/ 11920 | consumed samples: 8793088 | elapsed time per iteration (ms): 5936.9 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841554E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:29:47.563403 | finish at 2025-09-10 12:08:35 + [2025-09-10 06:38:53] iteration 8588/ 11920 | consumed samples: 8794112 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842319E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:12:22.678742 | finish at 2025-09-10 11:51:16 + [2025-09-10 06:38:59] iteration 8589/ 11920 | consumed samples: 8795136 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830348E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:12:13.521210 | finish at 2025-09-10 11:51:13 + [2025-09-10 06:39:05] iteration 8590/ 11920 | consumed samples: 8796160 | elapsed time per iteration (ms): 6175.3 | throughput per GPU (TFLOP/s/GPU): 73.1 | MFU 7.39% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840960E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:42:43.663659 | finish at 2025-09-10 12:21:49 + [2025-09-10 06:39:11] iteration 8591/ 11920 | consumed samples: 8797184 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820209E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:11:48.449430 | finish at 2025-09-10 11:50:59 + [2025-09-10 06:39:16] iteration 8592/ 11920 | consumed samples: 8798208 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839179E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:11:55.190857 | finish at 2025-09-10 11:51:12 + [2025-09-10 06:39:22] iteration 8593/ 11920 | consumed samples: 8799232 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837665E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:11:50.577069 | finish at 2025-09-10 11:51:13 + [2025-09-10 06:39:28] iteration 8594/ 11920 | consumed samples: 8800256 | elapsed time per iteration (ms): 5928.2 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838162E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:28:37.335165 | finish at 2025-09-10 12:08:05 + [2025-09-10 06:39:34] iteration 8595/ 11920 | consumed samples: 8801280 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821020E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:11:31.275889 | finish at 2025-09-10 11:51:05 + [2025-09-10 06:39:40] iteration 8596/ 11920 | consumed samples: 8802304 | elapsed time per iteration (ms): 5993.7 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839268E+00 | loss scale: 1.0 | grad norm: 0.142 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:32:03.073949 | finish at 2025-09-10 12:11:43 + [2025-09-10 06:39:45] iteration 8597/ 11920 | consumed samples: 8803328 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825662E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:11:13.679051 | finish at 2025-09-10 11:50:59 + [2025-09-10 06:39:51] iteration 8598/ 11920 | consumed samples: 8804352 | elapsed time per iteration (ms): 5619.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844397E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:11:06.958610 | finish at 2025-09-10 11:50:58 + [2025-09-10 06:39:57] iteration 8599/ 11920 | consumed samples: 8805376 | elapsed time per iteration (ms): 5913.5 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838819E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:27:18.661682 | finish at 2025-09-10 12:07:15 + [2025-09-10 06:40:02] iteration 8600/ 11920 | consumed samples: 8806400 | elapsed time per iteration (ms): 5621.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835990E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:11:01.746292 | finish at 2025-09-10 11:51:04 + [2025-09-10 06:40:08] iteration 8601/ 11920 | consumed samples: 8807424 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832559E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:10:51.966944 | finish at 2025-09-10 11:51:00 + [2025-09-10 06:40:14] iteration 8602/ 11920 | consumed samples: 8808448 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834174E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:10:58.494903 | finish at 2025-09-10 11:51:12 + [2025-09-10 06:40:19] iteration 8603/ 11920 | consumed samples: 8809472 | elapsed time per iteration (ms): 5633.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.833692E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:11:26.077043 | finish at 2025-09-10 11:51:45 + [2025-09-10 06:40:25] iteration 8604/ 11920 | consumed samples: 8810496 | elapsed time per iteration (ms): 5617.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828910E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:10:27.219111 | finish at 2025-09-10 11:50:52 + [2025-09-10 06:40:31] iteration 8605/ 11920 | consumed samples: 8811520 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830839E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:10:30.482193 | finish at 2025-09-10 11:51:01 + [2025-09-10 06:40:36] iteration 8606/ 11920 | consumed samples: 8812544 | elapsed time per iteration (ms): 5640.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841462E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:11:31.406765 | finish at 2025-09-10 11:52:08 + [2025-09-10 06:40:42] iteration 8607/ 11920 | consumed samples: 8813568 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819385E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:10:43.812904 | finish at 2025-09-10 11:51:26 + [2025-09-10 06:40:47] iteration 8608/ 11920 | consumed samples: 8814592 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828124E+00 | loss scale: 1.0 | grad norm: 0.129 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:10:30.331650 | finish at 2025-09-10 11:51:18 + [2025-09-10 06:40:53] iteration 8609/ 11920 | consumed samples: 8815616 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829514E+00 | loss scale: 1.0 | grad norm: 0.142 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:10:21.068187 | finish at 2025-09-10 11:51:14 + [2025-09-10 06:40:59] iteration 8610/ 11920 | consumed samples: 8816640 | elapsed time per iteration (ms): 5923.0 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830002E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:26:45.292890 | finish at 2025-09-10 12:07:44 + [2025-09-10 06:41:05] iteration 8611/ 11920 | consumed samples: 8817664 | elapsed time per iteration (ms): 5613.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.833818E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:09:35.307781 | finish at 2025-09-10 11:50:40 + [2025-09-10 06:41:10] iteration 8612/ 11920 | consumed samples: 8818688 | elapsed time per iteration (ms): 5617.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837502E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:09:41.246132 | finish at 2025-09-10 11:50:51 + [2025-09-10 06:41:16] iteration 8613/ 11920 | consumed samples: 8819712 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834139E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:09:59.940143 | finish at 2025-09-10 11:51:16 + [2025-09-10 06:41:22] iteration 8614/ 11920 | consumed samples: 8820736 | elapsed time per iteration (ms): 5899.1 | throughput per GPU (TFLOP/s/GPU): 76.5 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826495E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:25:02.353877 | finish at 2025-09-10 12:06:24 + [2025-09-10 06:41:27] iteration 8615/ 11920 | consumed samples: 8821760 | elapsed time per iteration (ms): 5616.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831579E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:09:23.591208 | finish at 2025-09-10 11:50:51 + [2025-09-10 06:41:33] iteration 8616/ 11920 | consumed samples: 8822784 | elapsed time per iteration (ms): 5848.2 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831748E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:22:02.594866 | finish at 2025-09-10 12:03:36 + [2025-09-10 06:41:39] iteration 8617/ 11920 | consumed samples: 8823808 | elapsed time per iteration (ms): 5952.0 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821972E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:27:39.490329 | finish at 2025-09-10 12:09:19 + [2025-09-10 06:41:45] iteration 8618/ 11920 | consumed samples: 8824832 | elapsed time per iteration (ms): 5837.9 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839776E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:21:16.601479 | finish at 2025-09-10 12:03:02 + [2025-09-10 06:41:51] iteration 8619/ 11920 | consumed samples: 8825856 | elapsed time per iteration (ms): 5613.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824761E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:08:48.779523 | finish at 2025-09-10 11:50:39 + [2025-09-10 06:41:56] iteration 8620/ 11920 | consumed samples: 8826880 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846802E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:09:07.965789 | finish at 2025-09-10 11:51:04 + [2025-09-10 06:42:02] iteration 8621/ 11920 | consumed samples: 8827904 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836876E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:09:02.313731 | finish at 2025-09-10 11:51:04 + [2025-09-10 06:42:07] iteration 8622/ 11920 | consumed samples: 8828928 | elapsed time per iteration (ms): 5629.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825422E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:09:27.295329 | finish at 2025-09-10 11:51:35 + [2025-09-10 06:42:13] iteration 8623/ 11920 | consumed samples: 8829952 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830785E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:08:50.530960 | finish at 2025-09-10 11:51:04 + [2025-09-10 06:42:19] iteration 8624/ 11920 | consumed samples: 8830976 | elapsed time per iteration (ms): 5634.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835344E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:09:29.753792 | finish at 2025-09-10 11:51:48 + [2025-09-10 06:42:25] iteration 8625/ 11920 | consumed samples: 8832000 | elapsed time per iteration (ms): 5819.9 | throughput per GPU (TFLOP/s/GPU): 77.6 | MFU 7.84% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835778E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:19:36.662532 | finish at 2025-09-10 12:02:01 + [2025-09-10 06:42:30] iteration 8626/ 11920 | consumed samples: 8833024 | elapsed time per iteration (ms): 5617.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838146E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:08:24.059356 | finish at 2025-09-10 11:50:54 + [2025-09-10 06:42:36] iteration 8627/ 11920 | consumed samples: 8834048 | elapsed time per iteration (ms): 5623.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838681E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:08:36.694144 | finish at 2025-09-10 11:51:12 + [2025-09-10 06:42:41] iteration 8628/ 11920 | consumed samples: 8835072 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830776E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:08:31.695072 | finish at 2025-09-10 11:51:13 + [2025-09-10 06:42:47] iteration 8629/ 11920 | consumed samples: 8836096 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839780E+00 | loss scale: 1.0 | grad norm: 0.256 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:08:20.565263 | finish at 2025-09-10 11:51:08 + [2025-09-10 06:42:53] iteration 8630/ 11920 | consumed samples: 8837120 | elapsed time per iteration (ms): 5629.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835456E+00 | loss scale: 1.0 | grad norm: 0.246 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:08:41.737921 | finish at 2025-09-10 11:51:34 + [2025-09-10 06:42:58] iteration 8631/ 11920 | consumed samples: 8838144 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837408E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:08:11.702058 | finish at 2025-09-10 11:51:10 + [2025-09-10 06:43:04] iteration 8632/ 11920 | consumed samples: 8839168 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844366E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:08:01.454641 | finish at 2025-09-10 11:51:05 + [2025-09-10 06:43:10] iteration 8633/ 11920 | consumed samples: 8840192 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841735E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:08:09.345220 | finish at 2025-09-10 11:51:19 + [2025-09-10 06:43:15] iteration 8634/ 11920 | consumed samples: 8841216 | elapsed time per iteration (ms): 5628.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820576E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:08:14.905451 | finish at 2025-09-10 11:51:30 + [2025-09-10 06:43:21] iteration 8635/ 11920 | consumed samples: 8842240 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836919E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:07:55.206778 | finish at 2025-09-10 11:51:16 + [2025-09-10 06:43:26] iteration 8636/ 11920 | consumed samples: 8843264 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827285E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:07:42.749717 | finish at 2025-09-10 11:51:09 + [2025-09-10 06:43:32] iteration 8637/ 11920 | consumed samples: 8844288 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840086E+00 | loss scale: 1.0 | grad norm: 0.134 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:07:35.775914 | finish at 2025-09-10 11:51:08 + [2025-09-10 06:43:38] iteration 8638/ 11920 | consumed samples: 8845312 | elapsed time per iteration (ms): 5629.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824878E+00 | loss scale: 1.0 | grad norm: 0.137 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:07:57.109503 | finish at 2025-09-10 11:51:35 + [2025-09-10 06:43:44] iteration 8639/ 11920 | consumed samples: 8846336 | elapsed time per iteration (ms): 5965.7 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824318E+00 | loss scale: 1.0 | grad norm: 0.136 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:26:13.618641 | finish at 2025-09-10 12:09:57 + [2025-09-10 06:43:49] iteration 8640/ 11920 | consumed samples: 8847360 | elapsed time per iteration (ms): 5627.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827632E+00 | loss scale: 1.0 | grad norm: 0.124 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:07:38.539581 | finish at 2025-09-10 11:51:28 + [2025-09-10 06:43:55] iteration 8641/ 11920 | consumed samples: 8848384 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845468E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:07:25.875225 | finish at 2025-09-10 11:51:21 + [2025-09-10 06:44:01] iteration 8642/ 11920 | consumed samples: 8849408 | elapsed time per iteration (ms): 5617.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831455E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:06:53.801805 | finish at 2025-09-10 11:50:54 + [2025-09-10 06:44:06] iteration 8643/ 11920 | consumed samples: 8850432 | elapsed time per iteration (ms): 5869.9 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836336E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:20:35.591163 | finish at 2025-09-10 12:04:42 + [2025-09-10 06:44:12] iteration 8644/ 11920 | consumed samples: 8851456 | elapsed time per iteration (ms): 5617.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840643E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:06:42.082769 | finish at 2025-09-10 11:50:54 + [2025-09-10 06:44:18] iteration 8645/ 11920 | consumed samples: 8852480 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842873E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:07:02.883040 | finish at 2025-09-10 11:51:21 + [2025-09-10 06:44:23] iteration 8646/ 11920 | consumed samples: 8853504 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827092E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:06:56.829192 | finish at 2025-09-10 11:51:20 + [2025-09-10 06:44:29] iteration 8647/ 11920 | consumed samples: 8854528 | elapsed time per iteration (ms): 5627.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836910E+00 | loss scale: 1.0 | grad norm: 0.260 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:06:57.387461 | finish at 2025-09-10 11:51:26 + [2025-09-10 06:44:35] iteration 8648/ 11920 | consumed samples: 8855552 | elapsed time per iteration (ms): 5631.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837280E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:07:05.278065 | finish at 2025-09-10 11:51:40 + [2025-09-10 06:44:40] iteration 8649/ 11920 | consumed samples: 8856576 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846050E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:06:29.296777 | finish at 2025-09-10 11:51:09 + [2025-09-10 06:44:46] iteration 8650/ 11920 | consumed samples: 8857600 | elapsed time per iteration (ms): 5631.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828776E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:06:54.493582 | finish at 2025-09-10 11:51:40 + [2025-09-10 06:44:51] iteration 8651/ 11920 | consumed samples: 8858624 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828820E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:06:16.288399 | finish at 2025-09-10 11:51:08 + [2025-09-10 06:44:57] iteration 8652/ 11920 | consumed samples: 8859648 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842183E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:06:25.180283 | finish at 2025-09-10 11:51:22 + [2025-09-10 06:45:03] iteration 8653/ 11920 | consumed samples: 8860672 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824020E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:06:09.427808 | finish at 2025-09-10 11:51:12 + [2025-09-10 06:45:09] iteration 8654/ 11920 | consumed samples: 8861696 | elapsed time per iteration (ms): 5950.2 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825586E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:23:53.195393 | finish at 2025-09-10 12:09:02 + [2025-09-10 06:45:14] iteration 8655/ 11920 | consumed samples: 8862720 | elapsed time per iteration (ms): 5905.5 | throughput per GPU (TFLOP/s/GPU): 76.5 | MFU 7.73% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840193E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:21:21.324176 | finish at 2025-09-10 12:06:36 + [2025-09-10 06:45:20] iteration 8656/ 11920 | consumed samples: 8863744 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832884E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:05:57.376694 | finish at 2025-09-10 11:51:17 + [2025-09-10 06:45:26] iteration 8657/ 11920 | consumed samples: 8864768 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831774E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:05:44.388330 | finish at 2025-09-10 11:51:10 + [2025-09-10 06:45:31] iteration 8658/ 11920 | consumed samples: 8865792 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830083E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:05:36.172689 | finish at 2025-09-10 11:51:08 + [2025-09-10 06:45:37] iteration 8659/ 11920 | consumed samples: 8866816 | elapsed time per iteration (ms): 5626.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838304E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:05:48.864379 | finish at 2025-09-10 11:51:26 + [2025-09-10 06:45:43] iteration 8660/ 11920 | consumed samples: 8867840 | elapsed time per iteration (ms): 5627.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825338E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:05:47.032905 | finish at 2025-09-10 11:51:30 + [2025-09-10 06:45:48] iteration 8661/ 11920 | consumed samples: 8868864 | elapsed time per iteration (ms): 5629.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.833283E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:05:47.314889 | finish at 2025-09-10 11:51:36 + [2025-09-10 06:45:54] iteration 8662/ 11920 | consumed samples: 8869888 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836092E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:05:11.872806 | finish at 2025-09-10 11:51:06 + [2025-09-10 06:46:00] iteration 8663/ 11920 | consumed samples: 8870912 | elapsed time per iteration (ms): 5954.1 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838684E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:23:12.501024 | finish at 2025-09-10 12:09:12 + [2025-09-10 06:46:05] iteration 8664/ 11920 | consumed samples: 8871936 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827761E+00 | loss scale: 1.0 | grad norm: 0.142 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:05:16.124846 | finish at 2025-09-10 11:51:22 + [2025-09-10 06:46:11] iteration 8665/ 11920 | consumed samples: 8872960 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828330E+00 | loss scale: 1.0 | grad norm: 0.142 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:04:53.300625 | finish at 2025-09-10 11:51:04 + [2025-09-10 06:46:17] iteration 8666/ 11920 | consumed samples: 8873984 | elapsed time per iteration (ms): 5618.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832778E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:04:42.099357 | finish at 2025-09-10 11:50:59 + [2025-09-10 06:46:23] iteration 8667/ 11920 | consumed samples: 8875008 | elapsed time per iteration (ms): 5961.4 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827446E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:23:12.355971 | finish at 2025-09-10 12:09:35 + [2025-09-10 06:46:28] iteration 8668/ 11920 | consumed samples: 8876032 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823521E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:04:40.530344 | finish at 2025-09-10 11:51:09 + [2025-09-10 06:46:34] iteration 8669/ 11920 | consumed samples: 8877056 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836100E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:04:44.615587 | finish at 2025-09-10 11:51:18 + [2025-09-10 06:46:40] iteration 8670/ 11920 | consumed samples: 8878080 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831095E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:04:48.701057 | finish at 2025-09-10 11:51:28 + [2025-09-10 06:46:45] iteration 8671/ 11920 | consumed samples: 8879104 | elapsed time per iteration (ms): 5617.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842667E+00 | loss scale: 1.0 | grad norm: 0.238 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:04:10.732523 | finish at 2025-09-10 11:50:56 + [2025-09-10 06:46:51] iteration 8672/ 11920 | consumed samples: 8880128 | elapsed time per iteration (ms): 5882.8 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838869E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:18:27.266586 | finish at 2025-09-10 12:05:18 + [2025-09-10 06:46:57] iteration 8673/ 11920 | consumed samples: 8881152 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829193E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:04:27.956969 | finish at 2025-09-10 11:51:25 + [2025-09-10 06:47:03] iteration 8674/ 11920 | consumed samples: 8882176 | elapsed time per iteration (ms): 6026.1 | throughput per GPU (TFLOP/s/GPU): 74.9 | MFU 7.58% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836191E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:26:00.799280 | finish at 2025-09-10 12:13:03 + [2025-09-10 06:47:09] iteration 8675/ 11920 | consumed samples: 8883200 | elapsed time per iteration (ms): 5856.5 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830084E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:16:44.416481 | finish at 2025-09-10 12:03:53 + [2025-09-10 06:47:14] iteration 8676/ 11920 | consumed samples: 8884224 | elapsed time per iteration (ms): 5835.7 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836123E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:15:31.119194 | finish at 2025-09-10 12:02:45 + [2025-09-10 06:47:20] iteration 8677/ 11920 | consumed samples: 8885248 | elapsed time per iteration (ms): 5629.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836914E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:04:16.713317 | finish at 2025-09-10 11:51:37 + [2025-09-10 06:47:26] iteration 8678/ 11920 | consumed samples: 8886272 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839409E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:03:44.074444 | finish at 2025-09-10 11:51:10 + [2025-09-10 06:47:31] iteration 8679/ 11920 | consumed samples: 8887296 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834583E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:03:53.042051 | finish at 2025-09-10 11:51:24 + [2025-09-10 06:47:37] iteration 8680/ 11920 | consumed samples: 8888320 | elapsed time per iteration (ms): 5628.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826888E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:03:57.042131 | finish at 2025-09-10 11:51:34 + [2025-09-10 06:47:42] iteration 8681/ 11920 | consumed samples: 8889344 | elapsed time per iteration (ms): 5616.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826111E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:03:10.326504 | finish at 2025-09-10 11:50:53 + [2025-09-10 06:47:48] iteration 8682/ 11920 | consumed samples: 8890368 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835584E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:03:18.763948 | finish at 2025-09-10 11:51:07 + [2025-09-10 06:47:54] iteration 8683/ 11920 | consumed samples: 8891392 | elapsed time per iteration (ms): 5618.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841014E+00 | loss scale: 1.0 | grad norm: 0.132 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:03:07.718869 | finish at 2025-09-10 11:51:01 + [2025-09-10 06:47:59] iteration 8684/ 11920 | consumed samples: 8892416 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.833079E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:03:19.643052 | finish at 2025-09-10 11:51:19 + [2025-09-10 06:48:05] iteration 8685/ 11920 | consumed samples: 8893440 | elapsed time per iteration (ms): 5617.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827370E+00 | loss scale: 1.0 | grad norm: 0.129 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:02:51.747335 | finish at 2025-09-10 11:50:57 + [2025-09-10 06:48:11] iteration 8686/ 11920 | consumed samples: 8894464 | elapsed time per iteration (ms): 5978.1 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839977E+00 | loss scale: 1.0 | grad norm: 0.126 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:22:13.313616 | finish at 2025-09-10 12:10:24 + [2025-09-10 06:48:17] iteration 8687/ 11920 | consumed samples: 8895488 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831077E+00 | loss scale: 1.0 | grad norm: 0.130 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:03:01.709299 | finish at 2025-09-10 11:51:18 + [2025-09-10 06:48:22] iteration 8688/ 11920 | consumed samples: 8896512 | elapsed time per iteration (ms): 5843.2 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850788E+00 | loss scale: 1.0 | grad norm: 0.113 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:14:45.245415 | finish at 2025-09-10 12:03:08 + [2025-09-10 06:48:28] iteration 8689/ 11920 | consumed samples: 8897536 | elapsed time per iteration (ms): 5618.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818844E+00 | loss scale: 1.0 | grad norm: 0.118 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:02:33.073053 | finish at 2025-09-10 11:51:01 + [2025-09-10 06:48:34] iteration 8690/ 11920 | consumed samples: 8898560 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831190E+00 | loss scale: 1.0 | grad norm: 0.122 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:02:37.937138 | finish at 2025-09-10 11:51:12 + [2025-09-10 06:48:39] iteration 8691/ 11920 | consumed samples: 8899584 | elapsed time per iteration (ms): 5836.0 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832459E+00 | loss scale: 1.0 | grad norm: 0.131 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:14:04.328411 | finish at 2025-09-10 12:02:44 + [2025-09-10 06:48:45] iteration 8692/ 11920 | consumed samples: 8900608 | elapsed time per iteration (ms): 5618.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827866E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:02:17.814013 | finish at 2025-09-10 11:51:03 + [2025-09-10 06:48:51] iteration 8693/ 11920 | consumed samples: 8901632 | elapsed time per iteration (ms): 5617.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827857E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:02:09.023741 | finish at 2025-09-10 11:51:00 + [2025-09-10 06:48:57] iteration 8694/ 11920 | consumed samples: 8902656 | elapsed time per iteration (ms): 5943.8 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852423E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:19:34.688713 | finish at 2025-09-10 12:08:31 + [2025-09-10 06:49:02] iteration 8695/ 11920 | consumed samples: 8903680 | elapsed time per iteration (ms): 5615.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.833438E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:01:50.747856 | finish at 2025-09-10 11:50:53 + [2025-09-10 06:49:08] iteration 8696/ 11920 | consumed samples: 8904704 | elapsed time per iteration (ms): 5615.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848670E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:01:45.003754 | finish at 2025-09-10 11:50:53 + [2025-09-10 06:49:14] iteration 8697/ 11920 | consumed samples: 8905728 | elapsed time per iteration (ms): 5618.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824493E+00 | loss scale: 1.0 | grad norm: 0.244 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:01:47.083815 | finish at 2025-09-10 11:51:01 + [2025-09-10 06:49:19] iteration 8698/ 11920 | consumed samples: 8906752 | elapsed time per iteration (ms): 5628.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830935E+00 | loss scale: 1.0 | grad norm: 0.307 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:02:13.724110 | finish at 2025-09-10 11:51:33 + [2025-09-10 06:49:25] iteration 8699/ 11920 | consumed samples: 8907776 | elapsed time per iteration (ms): 5974.6 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843648E+00 | loss scale: 1.0 | grad norm: 0.266 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:20:44.183775 | finish at 2025-09-10 12:10:09 + [2025-09-10 06:49:31] iteration 8700/ 11920 | consumed samples: 8908800 | elapsed time per iteration (ms): 5615.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828422E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:01:23.304839 | finish at 2025-09-10 11:50:54 + [2025-09-10 06:49:36] iteration 8701/ 11920 | consumed samples: 8909824 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.833362E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:01:30.734351 | finish at 2025-09-10 11:51:07 + [2025-09-10 06:49:42] iteration 8702/ 11920 | consumed samples: 8910848 | elapsed time per iteration (ms): 5619.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836426E+00 | loss scale: 1.0 | grad norm: 0.248 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:01:23.148719 | finish at 2025-09-10 11:51:05 + [2025-09-10 06:49:48] iteration 8703/ 11920 | consumed samples: 8911872 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836011E+00 | loss scale: 1.0 | grad norm: 0.256 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:01:32.823176 | finish at 2025-09-10 11:51:20 + [2025-09-10 06:49:53] iteration 8704/ 11920 | consumed samples: 8912896 | elapsed time per iteration (ms): 5626.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824864E+00 | loss scale: 1.0 | grad norm: 0.255 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:01:36.036655 | finish at 2025-09-10 11:51:29 + [2025-09-10 06:49:59] iteration 8705/ 11920 | consumed samples: 8913920 | elapsed time per iteration (ms): 5626.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834525E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:01:29.500691 | finish at 2025-09-10 11:51:28 + [2025-09-10 06:50:05] iteration 8706/ 11920 | consumed samples: 8914944 | elapsed time per iteration (ms): 5936.1 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832647E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:17:58.642278 | finish at 2025-09-10 12:08:03 + [2025-09-10 06:50:11] iteration 8707/ 11920 | consumed samples: 8915968 | elapsed time per iteration (ms): 5947.8 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837283E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:18:30.243610 | finish at 2025-09-10 12:08:41 + [2025-09-10 06:50:17] iteration 8708/ 11920 | consumed samples: 8916992 | elapsed time per iteration (ms): 5859.4 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.833094E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:13:40.486337 | finish at 2025-09-10 12:03:57 + [2025-09-10 06:50:22] iteration 8709/ 11920 | consumed samples: 8918016 | elapsed time per iteration (ms): 5616.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.833844E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:00:35.421769 | finish at 2025-09-10 11:50:58 + [2025-09-10 06:50:28] iteration 8710/ 11920 | consumed samples: 8919040 | elapsed time per iteration (ms): 5920.0 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836441E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:16:43.259940 | finish at 2025-09-10 12:07:11 + [2025-09-10 06:50:34] iteration 8711/ 11920 | consumed samples: 8920064 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839950E+00 | loss scale: 1.0 | grad norm: 0.229 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:00:45.037583 | finish at 2025-09-10 11:51:19 + [2025-09-10 06:50:40] iteration 8712/ 11920 | consumed samples: 8921088 | elapsed time per iteration (ms): 5833.2 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839850E+00 | loss scale: 1.0 | grad norm: 0.244 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:11:52.845106 | finish at 2025-09-10 12:02:32 + [2025-09-10 06:50:45] iteration 8713/ 11920 | consumed samples: 8922112 | elapsed time per iteration (ms): 5616.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843848E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:00:11.602130 | finish at 2025-09-10 11:50:57 + [2025-09-10 06:50:51] iteration 8714/ 11920 | consumed samples: 8923136 | elapsed time per iteration (ms): 5626.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827482E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:00:36.900733 | finish at 2025-09-10 11:51:28 + [2025-09-10 06:50:57] iteration 8715/ 11920 | consumed samples: 8924160 | elapsed time per iteration (ms): 6257.9 | throughput per GPU (TFLOP/s/GPU): 72.1 | MFU 7.29% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846090E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:34:16.412088 | finish at 2025-09-10 12:25:13 + [2025-09-10 06:51:03] iteration 8716/ 11920 | consumed samples: 8925184 | elapsed time per iteration (ms): 5617.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832842E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:59:59.037023 | finish at 2025-09-10 11:51:02 + [2025-09-10 06:51:08] iteration 8717/ 11920 | consumed samples: 8926208 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842723E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:00:01.503394 | finish at 2025-09-10 11:51:10 + [2025-09-10 06:51:14] iteration 8718/ 11920 | consumed samples: 8927232 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839623E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:00:07.171830 | finish at 2025-09-10 11:51:21 + [2025-09-10 06:51:20] iteration 8719/ 11920 | consumed samples: 8928256 | elapsed time per iteration (ms): 5628.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837498E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:00:15.618814 | finish at 2025-09-10 11:51:35 + [2025-09-10 06:51:25] iteration 8720/ 11920 | consumed samples: 8929280 | elapsed time per iteration (ms): 5616.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.833653E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:59:31.962738 | finish at 2025-09-10 11:50:57 + [2025-09-10 06:51:31] iteration 8721/ 11920 | consumed samples: 8930304 | elapsed time per iteration (ms): 5618.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835463E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:59:32.980473 | finish at 2025-09-10 11:51:04 + [2025-09-10 06:51:36] iteration 8722/ 11920 | consumed samples: 8931328 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841032E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:59:46.880442 | finish at 2025-09-10 11:51:23 + [2025-09-10 06:51:42] iteration 8723/ 11920 | consumed samples: 8932352 | elapsed time per iteration (ms): 5628.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815776E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:59:54.351800 | finish at 2025-09-10 11:51:36 + [2025-09-10 06:51:48] iteration 8724/ 11920 | consumed samples: 8933376 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814096E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:59:45.472657 | finish at 2025-09-10 11:51:33 + [2025-09-10 06:51:54] iteration 8725/ 11920 | consumed samples: 8934400 | elapsed time per iteration (ms): 5876.5 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829552E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:12:55.396585 | finish at 2025-09-10 12:04:49 + [2025-09-10 06:51:59] iteration 8726/ 11920 | consumed samples: 8935424 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837360E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:59:25.696383 | finish at 2025-09-10 11:51:25 + [2025-09-10 06:52:05] iteration 8727/ 11920 | consumed samples: 8936448 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828746E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:59:14.020217 | finish at 2025-09-10 11:51:19 + [2025-09-10 06:52:10] iteration 8728/ 11920 | consumed samples: 8937472 | elapsed time per iteration (ms): 5618.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838012E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:58:52.931591 | finish at 2025-09-10 11:51:03 + [2025-09-10 06:52:16] iteration 8729/ 11920 | consumed samples: 8938496 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826340E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:59:00.596962 | finish at 2025-09-10 11:51:17 + [2025-09-10 06:52:22] iteration 8730/ 11920 | consumed samples: 8939520 | elapsed time per iteration (ms): 5631.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826340E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:59:23.224018 | finish at 2025-09-10 11:51:45 + [2025-09-10 06:52:27] iteration 8731/ 11920 | consumed samples: 8940544 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828464E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:58:46.955185 | finish at 2025-09-10 11:51:14 + [2025-09-10 06:52:33] iteration 8732/ 11920 | consumed samples: 8941568 | elapsed time per iteration (ms): 5619.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824039E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:58:33.659936 | finish at 2025-09-10 11:51:07 + [2025-09-10 06:52:39] iteration 8733/ 11920 | consumed samples: 8942592 | elapsed time per iteration (ms): 5616.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827077E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:58:19.896121 | finish at 2025-09-10 11:50:58 + [2025-09-10 06:52:44] iteration 8734/ 11920 | consumed samples: 8943616 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850640E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:58:28.216756 | finish at 2025-09-10 11:51:12 + [2025-09-10 06:52:50] iteration 8735/ 11920 | consumed samples: 8944640 | elapsed time per iteration (ms): 5629.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843760E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:58:48.758945 | finish at 2025-09-10 11:51:39 + [2025-09-10 06:52:56] iteration 8736/ 11920 | consumed samples: 8945664 | elapsed time per iteration (ms): 5973.1 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830364E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:16:58.288742 | finish at 2025-09-10 12:09:54 + [2025-09-10 06:53:01] iteration 8737/ 11920 | consumed samples: 8946688 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834511E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:58:19.946139 | finish at 2025-09-10 11:51:21 + [2025-09-10 06:53:07] iteration 8738/ 11920 | consumed samples: 8947712 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841287E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:58:11.532224 | finish at 2025-09-10 11:51:19 + [2025-09-10 06:53:13] iteration 8739/ 11920 | consumed samples: 8948736 | elapsed time per iteration (ms): 5890.1 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831643E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:12:16.408107 | finish at 2025-09-10 12:05:29 + [2025-09-10 06:53:19] iteration 8740/ 11920 | consumed samples: 8949760 | elapsed time per iteration (ms): 5838.5 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823575E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:09:26.338334 | finish at 2025-09-10 12:02:45 + [2025-09-10 06:53:24] iteration 8741/ 11920 | consumed samples: 8950784 | elapsed time per iteration (ms): 5627.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842945E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:58:10.804963 | finish at 2025-09-10 11:51:35 + [2025-09-10 06:53:30] iteration 8742/ 11920 | consumed samples: 8951808 | elapsed time per iteration (ms): 5641.0 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835303E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:58:47.051884 | finish at 2025-09-10 11:52:17 + [2025-09-10 06:53:36] iteration 8743/ 11920 | consumed samples: 8952832 | elapsed time per iteration (ms): 5626.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832236E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:57:55.837811 | finish at 2025-09-10 11:51:31 + [2025-09-10 06:53:41] iteration 8744/ 11920 | consumed samples: 8953856 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843452E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:57:28.900801 | finish at 2025-09-10 11:51:10 + [2025-09-10 06:53:47] iteration 8745/ 11920 | consumed samples: 8954880 | elapsed time per iteration (ms): 5626.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825531E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:57:45.214336 | finish at 2025-09-10 11:51:32 + [2025-09-10 06:53:53] iteration 8746/ 11920 | consumed samples: 8955904 | elapsed time per iteration (ms): 5623.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838808E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:57:27.380515 | finish at 2025-09-10 11:51:20 + [2025-09-10 06:53:58] iteration 8747/ 11920 | consumed samples: 8956928 | elapsed time per iteration (ms): 5634.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825057E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:57:59.497144 | finish at 2025-09-10 11:51:58 + [2025-09-10 06:54:04] iteration 8748/ 11920 | consumed samples: 8957952 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834274E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:57:21.530470 | finish at 2025-09-10 11:51:25 + [2025-09-10 06:54:09] iteration 8749/ 11920 | consumed samples: 8958976 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817041E+00 | loss scale: 1.0 | grad norm: 0.137 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:57:09.534750 | finish at 2025-09-10 11:51:19 + [2025-09-10 06:54:15] iteration 8750/ 11920 | consumed samples: 8960000 | elapsed time per iteration (ms): 5614.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817765E+00 | loss scale: 1.0 | grad norm: 0.129 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:56:36.666703 | finish at 2025-09-10 11:50:52 + [2025-09-10 06:54:21] iteration 8751/ 11920 | consumed samples: 8961024 | elapsed time per iteration (ms): 5970.2 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821791E+00 | loss scale: 1.0 | grad norm: 0.134 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:15:19.562485 | finish at 2025-09-10 12:09:41 + [2025-09-10 06:54:27] iteration 8752/ 11920 | consumed samples: 8962048 | elapsed time per iteration (ms): 5616.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818316E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:56:32.240089 | finish at 2025-09-10 11:50:59 + [2025-09-10 06:54:32] iteration 8753/ 11920 | consumed samples: 8963072 | elapsed time per iteration (ms): 5632.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821201E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:57:18.229978 | finish at 2025-09-10 11:51:50 + [2025-09-10 06:54:38] iteration 8754/ 11920 | consumed samples: 8964096 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842394E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:56:33.562756 | finish at 2025-09-10 11:51:11 + [2025-09-10 06:54:43] iteration 8755/ 11920 | consumed samples: 8965120 | elapsed time per iteration (ms): 5616.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830327E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:56:17.531408 | finish at 2025-09-10 11:51:01 + [2025-09-10 06:54:49] iteration 8756/ 11920 | consumed samples: 8966144 | elapsed time per iteration (ms): 5625.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829948E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:56:40.237559 | finish at 2025-09-10 11:51:29 + [2025-09-10 06:54:55] iteration 8757/ 11920 | consumed samples: 8967168 | elapsed time per iteration (ms): 5636.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818696E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:57:07.581732 | finish at 2025-09-10 11:52:02 + [2025-09-10 06:55:00] iteration 8758/ 11920 | consumed samples: 8968192 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814427E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:56:08.946203 | finish at 2025-09-10 11:51:09 + [2025-09-10 06:55:06] iteration 8759/ 11920 | consumed samples: 8969216 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811243E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:56:16.506351 | finish at 2025-09-10 11:51:22 + [2025-09-10 06:55:12] iteration 8760/ 11920 | consumed samples: 8970240 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829690E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:56:05.080700 | finish at 2025-09-10 11:51:17 + [2025-09-10 06:55:18] iteration 8761/ 11920 | consumed samples: 8971264 | elapsed time per iteration (ms): 5933.7 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821458E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:12:24.527688 | finish at 2025-09-10 12:07:42 + [2025-09-10 06:55:23] iteration 8762/ 11920 | consumed samples: 8972288 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814672E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:56:01.720112 | finish at 2025-09-10 11:51:25 + [2025-09-10 06:55:29] iteration 8763/ 11920 | consumed samples: 8973312 | elapsed time per iteration (ms): 5931.7 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815974E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:12:06.260206 | finish at 2025-09-10 12:07:35 + [2025-09-10 06:55:35] iteration 8764/ 11920 | consumed samples: 8974336 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842618E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:55:52.332204 | finish at 2025-09-10 11:51:27 + [2025-09-10 06:55:40] iteration 8765/ 11920 | consumed samples: 8975360 | elapsed time per iteration (ms): 5637.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820974E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:56:24.702168 | finish at 2025-09-10 11:52:05 + [2025-09-10 06:55:46] iteration 8766/ 11920 | consumed samples: 8976384 | elapsed time per iteration (ms): 5617.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841962E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:55:18.156182 | finish at 2025-09-10 11:51:04 + [2025-09-10 06:55:52] iteration 8767/ 11920 | consumed samples: 8977408 | elapsed time per iteration (ms): 5631.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828688E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:55:56.571310 | finish at 2025-09-10 11:51:48 + [2025-09-10 06:55:57] iteration 8768/ 11920 | consumed samples: 8978432 | elapsed time per iteration (ms): 5631.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822620E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:55:51.116268 | finish at 2025-09-10 11:51:48 + [2025-09-10 06:56:03] iteration 8769/ 11920 | consumed samples: 8979456 | elapsed time per iteration (ms): 5973.7 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813825E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:13:43.131852 | finish at 2025-09-10 12:09:46 + [2025-09-10 06:56:09] iteration 8770/ 11920 | consumed samples: 8980480 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824466E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:55:07.596624 | finish at 2025-09-10 11:51:16 + [2025-09-10 06:56:14] iteration 8771/ 11920 | consumed samples: 8981504 | elapsed time per iteration (ms): 5617.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813819E+00 | loss scale: 1.0 | grad norm: 0.238 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:54:50.699949 | finish at 2025-09-10 11:51:05 + [2025-09-10 06:56:20] iteration 8772/ 11920 | consumed samples: 8982528 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830132E+00 | loss scale: 1.0 | grad norm: 0.252 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:54:58.551291 | finish at 2025-09-10 11:51:19 + [2025-09-10 06:56:26] iteration 8773/ 11920 | consumed samples: 8983552 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817425E+00 | loss scale: 1.0 | grad norm: 0.242 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:54:58.939814 | finish at 2025-09-10 11:51:25 + [2025-09-10 06:56:31] iteration 8774/ 11920 | consumed samples: 8984576 | elapsed time per iteration (ms): 5629.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826040E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:55:09.994938 | finish at 2025-09-10 11:51:41 + [2025-09-10 06:56:37] iteration 8775/ 11920 | consumed samples: 8985600 | elapsed time per iteration (ms): 5951.2 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831787E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:11:56.569016 | finish at 2025-09-10 12:08:34 + [2025-09-10 06:56:43] iteration 8776/ 11920 | consumed samples: 8986624 | elapsed time per iteration (ms): 5623.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817693E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:54:38.866121 | finish at 2025-09-10 11:51:22 + [2025-09-10 06:56:49] iteration 8777/ 11920 | consumed samples: 8987648 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840463E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:54:37.840332 | finish at 2025-09-10 11:51:26 + [2025-09-10 06:56:54] iteration 8778/ 11920 | consumed samples: 8988672 | elapsed time per iteration (ms): 5839.6 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825425E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:05:47.971033 | finish at 2025-09-10 12:02:42 + [2025-09-10 06:57:00] iteration 8779/ 11920 | consumed samples: 8989696 | elapsed time per iteration (ms): 5874.9 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832737E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:07:33.075451 | finish at 2025-09-10 12:04:33 + [2025-09-10 06:57:06] iteration 8780/ 11920 | consumed samples: 8990720 | elapsed time per iteration (ms): 5837.4 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815280E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:05:29.336305 | finish at 2025-09-10 12:02:35 + [2025-09-10 06:57:12] iteration 8781/ 11920 | consumed samples: 8991744 | elapsed time per iteration (ms): 5623.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821276E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:54:11.052480 | finish at 2025-09-10 11:51:23 + [2025-09-10 06:57:17] iteration 8782/ 11920 | consumed samples: 8992768 | elapsed time per iteration (ms): 5617.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823921E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:53:46.422393 | finish at 2025-09-10 11:51:04 + [2025-09-10 06:57:23] iteration 8783/ 11920 | consumed samples: 8993792 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822755E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:53:55.721803 | finish at 2025-09-10 11:51:19 + [2025-09-10 06:57:29] iteration 8784/ 11920 | consumed samples: 8994816 | elapsed time per iteration (ms): 5616.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820114E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:53:34.177353 | finish at 2025-09-10 11:51:03 + [2025-09-10 06:57:35] iteration 8785/ 11920 | consumed samples: 8995840 | elapsed time per iteration (ms): 5955.5 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830173E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:11:10.429362 | finish at 2025-09-10 12:08:45 + [2025-09-10 06:57:40] iteration 8786/ 11920 | consumed samples: 8996864 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819736E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:53:46.346245 | finish at 2025-09-10 11:51:26 + [2025-09-10 06:57:46] iteration 8787/ 11920 | consumed samples: 8997888 | elapsed time per iteration (ms): 6068.2 | throughput per GPU (TFLOP/s/GPU): 74.4 | MFU 7.52% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831600E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:16:51.780753 | finish at 2025-09-10 12:14:38 + [2025-09-10 06:57:52] iteration 8788/ 11920 | consumed samples: 8998912 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834213E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:53:27.729077 | finish at 2025-09-10 11:51:20 + [2025-09-10 06:57:57] iteration 8789/ 11920 | consumed samples: 8999936 | elapsed time per iteration (ms): 5634.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818868E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:54:00.613317 | finish at 2025-09-10 11:51:58 + [2025-09-10 06:58:03] iteration 8790/ 11920 | consumed samples: 9000960 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821497E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:53:19.296441 | finish at 2025-09-10 11:51:22 + [2025-09-10 06:58:09] iteration 8791/ 11920 | consumed samples: 9001984 | elapsed time per iteration (ms): 5615.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813539E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:52:51.792392 | finish at 2025-09-10 11:51:00 + [2025-09-10 06:58:14] iteration 8792/ 11920 | consumed samples: 9003008 | elapsed time per iteration (ms): 5619.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814463E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:52:57.987419 | finish at 2025-09-10 11:51:12 + [2025-09-10 06:58:20] iteration 8793/ 11920 | consumed samples: 9004032 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821997E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:52:55.654176 | finish at 2025-09-10 11:51:16 + [2025-09-10 06:58:26] iteration 8794/ 11920 | consumed samples: 9005056 | elapsed time per iteration (ms): 6095.1 | throughput per GPU (TFLOP/s/GPU): 74.1 | MFU 7.49% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831309E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:17:33.385220 | finish at 2025-09-10 12:15:59 + [2025-09-10 06:58:32] iteration 8795/ 11920 | consumed samples: 9006080 | elapsed time per iteration (ms): 5626.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820855E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 10.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:53:01.930757 | finish at 2025-09-10 11:51:34 + [2025-09-10 06:58:37] iteration 8796/ 11920 | consumed samples: 9007104 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826167E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:52:38.068375 | finish at 2025-09-10 11:51:15 + [2025-09-10 06:58:43] iteration 8797/ 11920 | consumed samples: 9008128 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835275E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:52:45.373924 | finish at 2025-09-10 11:51:28 + [2025-09-10 06:58:49] iteration 8798/ 11920 | consumed samples: 9009152 | elapsed time per iteration (ms): 5627.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816685E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:52:48.892168 | finish at 2025-09-10 11:51:37 + [2025-09-10 06:58:54] iteration 8799/ 11920 | consumed samples: 9010176 | elapsed time per iteration (ms): 5643.0 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825926E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:53:31.673175 | finish at 2025-09-10 11:52:26 + [2025-09-10 06:59:00] iteration 8800/ 11920 | consumed samples: 9011200 | elapsed time per iteration (ms): 5632.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825673E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:52:52.071991 | finish at 2025-09-10 11:51:52 + [2025-09-10 06:59:05] iteration 8801/ 11920 | consumed samples: 9012224 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821166E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:52:17.410184 | finish at 2025-09-10 11:51:23 + [2025-09-10 06:59:11] iteration 8802/ 11920 | consumed samples: 9013248 | elapsed time per iteration (ms): 5618.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834268E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:51:57.928414 | finish at 2025-09-10 11:51:09 + [2025-09-10 06:59:17] iteration 8803/ 11920 | consumed samples: 9014272 | elapsed time per iteration (ms): 5614.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838823E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:51:41.778903 | finish at 2025-09-10 11:50:58 + [2025-09-10 06:59:22] iteration 8804/ 11920 | consumed samples: 9015296 | elapsed time per iteration (ms): 5614.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826445E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:51:35.829649 | finish at 2025-09-10 11:50:58 + [2025-09-10 06:59:28] iteration 8805/ 11920 | consumed samples: 9016320 | elapsed time per iteration (ms): 5917.1 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819514E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:07:11.780463 | finish at 2025-09-10 12:06:40 + [2025-09-10 06:59:34] iteration 8806/ 11920 | consumed samples: 9017344 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826897E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:51:52.692992 | finish at 2025-09-10 11:51:27 + [2025-09-10 06:59:39] iteration 8807/ 11920 | consumed samples: 9018368 | elapsed time per iteration (ms): 5615.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819147E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:51:20.325548 | finish at 2025-09-10 11:51:00 + [2025-09-10 06:59:45] iteration 8808/ 11920 | consumed samples: 9019392 | elapsed time per iteration (ms): 5636.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819352E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:52:20.337263 | finish at 2025-09-10 11:52:05 + [2025-09-10 06:59:51] iteration 8809/ 11920 | consumed samples: 9020416 | elapsed time per iteration (ms): 5629.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828947E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:51:52.241620 | finish at 2025-09-10 11:51:43 + [2025-09-10 06:59:56] iteration 8810/ 11920 | consumed samples: 9021440 | elapsed time per iteration (ms): 5637.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821868E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:52:11.402445 | finish at 2025-09-10 11:52:08 + [2025-09-10 07:00:02] iteration 8811/ 11920 | consumed samples: 9022464 | elapsed time per iteration (ms): 5998.9 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821045E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:10:50.588798 | finish at 2025-09-10 12:10:53 + [2025-09-10 07:00:08] iteration 8812/ 11920 | consumed samples: 9023488 | elapsed time per iteration (ms): 5619.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822755E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:51:04.483947 | finish at 2025-09-10 11:51:12 + [2025-09-10 07:00:14] iteration 8813/ 11920 | consumed samples: 9024512 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.833366E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:50:59.855148 | finish at 2025-09-10 11:51:13 + [2025-09-10 07:00:19] iteration 8814/ 11920 | consumed samples: 9025536 | elapsed time per iteration (ms): 5919.6 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816126E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:06:26.395375 | finish at 2025-09-10 12:06:46 + [2025-09-10 07:00:25] iteration 8815/ 11920 | consumed samples: 9026560 | elapsed time per iteration (ms): 5617.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825627E+00 | loss scale: 1.0 | grad norm: 0.242 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:50:42.476882 | finish at 2025-09-10 11:51:08 + [2025-09-10 07:00:31] iteration 8816/ 11920 | consumed samples: 9027584 | elapsed time per iteration (ms): 5628.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835537E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:51:10.288193 | finish at 2025-09-10 11:51:41 + [2025-09-10 07:00:36] iteration 8817/ 11920 | consumed samples: 9028608 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822674E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:50:40.215722 | finish at 2025-09-10 11:51:17 + [2025-09-10 07:00:42] iteration 8818/ 11920 | consumed samples: 9029632 | elapsed time per iteration (ms): 5618.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836751E+00 | loss scale: 1.0 | grad norm: 0.229 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:50:27.749044 | finish at 2025-09-10 11:51:10 + [2025-09-10 07:00:48] iteration 8819/ 11920 | consumed samples: 9030656 | elapsed time per iteration (ms): 5989.1 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824564E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:09:32.350600 | finish at 2025-09-10 12:10:20 + [2025-09-10 07:00:54] iteration 8820/ 11920 | consumed samples: 9031680 | elapsed time per iteration (ms): 5967.8 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830997E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:08:20.265431 | finish at 2025-09-10 12:09:14 + [2025-09-10 07:01:00] iteration 8821/ 11920 | consumed samples: 9032704 | elapsed time per iteration (ms): 5876.5 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830357E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:03:31.214793 | finish at 2025-09-10 12:04:31 + [2025-09-10 07:01:05] iteration 8822/ 11920 | consumed samples: 9033728 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828647E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:50:17.088887 | finish at 2025-09-10 11:51:23 + [2025-09-10 07:01:11] iteration 8823/ 11920 | consumed samples: 9034752 | elapsed time per iteration (ms): 5615.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831466E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:49:51.690750 | finish at 2025-09-10 11:51:03 + [2025-09-10 07:01:17] iteration 8824/ 11920 | consumed samples: 9035776 | elapsed time per iteration (ms): 5619.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813127E+00 | loss scale: 1.0 | grad norm: 0.131 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:49:57.447676 | finish at 2025-09-10 11:51:14 + [2025-09-10 07:01:22] iteration 8825/ 11920 | consumed samples: 9036800 | elapsed time per iteration (ms): 5628.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826125E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:50:20.307808 | finish at 2025-09-10 11:51:43 + [2025-09-10 07:01:28] iteration 8826/ 11920 | consumed samples: 9037824 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825396E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:49:58.559774 | finish at 2025-09-10 11:51:26 + [2025-09-10 07:01:34] iteration 8827/ 11920 | consumed samples: 9038848 | elapsed time per iteration (ms): 5632.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822813E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:50:21.241914 | finish at 2025-09-10 11:51:55 + [2025-09-10 07:01:40] iteration 8828/ 11920 | consumed samples: 9039872 | elapsed time per iteration (ms): 5979.7 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826814E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:08:09.356522 | finish at 2025-09-10 12:09:49 + [2025-09-10 07:01:45] iteration 8829/ 11920 | consumed samples: 9040896 | elapsed time per iteration (ms): 5820.1 | throughput per GPU (TFLOP/s/GPU): 77.6 | MFU 7.84% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815107E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:59:50.014800 | finish at 2025-09-10 12:01:35 + [2025-09-10 07:01:51] iteration 8830/ 11920 | consumed samples: 9041920 | elapsed time per iteration (ms): 5628.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827204E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:49:51.618505 | finish at 2025-09-10 11:51:43 + [2025-09-10 07:01:57] iteration 8831/ 11920 | consumed samples: 9042944 | elapsed time per iteration (ms): 5627.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823700E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:49:44.119503 | finish at 2025-09-10 11:51:41 + [2025-09-10 07:02:02] iteration 8832/ 11920 | consumed samples: 9043968 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823468E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:49:19.123577 | finish at 2025-09-10 11:51:21 + [2025-09-10 07:02:08] iteration 8833/ 11920 | consumed samples: 9044992 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826857E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:49:21.336064 | finish at 2025-09-10 11:51:29 + [2025-09-10 07:02:14] iteration 8834/ 11920 | consumed samples: 9046016 | elapsed time per iteration (ms): 5988.9 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836067E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:08:01.700353 | finish at 2025-09-10 12:10:16 + [2025-09-10 07:02:19] iteration 8835/ 11920 | consumed samples: 9047040 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838777E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:48:57.396612 | finish at 2025-09-10 11:51:17 + [2025-09-10 07:02:25] iteration 8836/ 11920 | consumed samples: 9048064 | elapsed time per iteration (ms): 5631.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817863E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:49:28.817322 | finish at 2025-09-10 11:51:54 + [2025-09-10 07:02:31] iteration 8837/ 11920 | consumed samples: 9049088 | elapsed time per iteration (ms): 5632.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828105E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:49:23.839599 | finish at 2025-09-10 11:51:55 + [2025-09-10 07:02:37] iteration 8838/ 11920 | consumed samples: 9050112 | elapsed time per iteration (ms): 5940.8 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839951E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:05:09.649793 | finish at 2025-09-10 12:07:46 + [2025-09-10 07:02:42] iteration 8839/ 11920 | consumed samples: 9051136 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825058E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:48:39.674800 | finish at 2025-09-10 11:51:22 + [2025-09-10 07:02:48] iteration 8840/ 11920 | consumed samples: 9052160 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818009E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:48:36.836462 | finish at 2025-09-10 11:51:25 + [2025-09-10 07:02:54] iteration 8841/ 11920 | consumed samples: 9053184 | elapsed time per iteration (ms): 5636.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818356E+00 | loss scale: 1.0 | grad norm: 0.229 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:49:15.699281 | finish at 2025-09-10 11:52:09 + [2025-09-10 07:02:59] iteration 8842/ 11920 | consumed samples: 9054208 | elapsed time per iteration (ms): 5645.5 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820927E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:49:36.710132 | finish at 2025-09-10 11:52:36 + [2025-09-10 07:03:05] iteration 8843/ 11920 | consumed samples: 9055232 | elapsed time per iteration (ms): 5635.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825401E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:49:01.731122 | finish at 2025-09-10 11:52:07 + [2025-09-10 07:03:10] iteration 8844/ 11920 | consumed samples: 9056256 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819777E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:48:09.760533 | finish at 2025-09-10 11:51:20 + [2025-09-10 07:03:16] iteration 8845/ 11920 | consumed samples: 9057280 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828001E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:48:12.726910 | finish at 2025-09-10 11:51:29 + [2025-09-10 07:03:22] iteration 8846/ 11920 | consumed samples: 9058304 | elapsed time per iteration (ms): 5618.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819978E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:47:51.751963 | finish at 2025-09-10 11:51:13 + [2025-09-10 07:03:27] iteration 8847/ 11920 | consumed samples: 9059328 | elapsed time per iteration (ms): 5618.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814422E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:47:44.226191 | finish at 2025-09-10 11:51:12 + [2025-09-10 07:03:33] iteration 8848/ 11920 | consumed samples: 9060352 | elapsed time per iteration (ms): 5634.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834907E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:48:29.378906 | finish at 2025-09-10 11:52:02 + [2025-09-10 07:03:39] iteration 8849/ 11920 | consumed samples: 9061376 | elapsed time per iteration (ms): 5627.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843250E+00 | loss scale: 1.0 | grad norm: 0.249 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:48:02.822201 | finish at 2025-09-10 11:51:41 + [2025-09-10 07:03:44] iteration 8850/ 11920 | consumed samples: 9062400 | elapsed time per iteration (ms): 5842.7 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848280E+00 | loss scale: 1.0 | grad norm: 0.252 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:58:57.149034 | finish at 2025-09-10 12:02:42 + [2025-09-10 07:03:50] iteration 8851/ 11920 | consumed samples: 9063424 | elapsed time per iteration (ms): 5835.6 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830279E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:58:29.525603 | finish at 2025-09-10 12:02:20 + [2025-09-10 07:03:56] iteration 8852/ 11920 | consumed samples: 9064448 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825630E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:47:33.774632 | finish at 2025-09-10 11:51:30 + [2025-09-10 07:04:02] iteration 8853/ 11920 | consumed samples: 9065472 | elapsed time per iteration (ms): 5973.2 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843530E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:05:19.841947 | finish at 2025-09-10 12:09:22 + [2025-09-10 07:04:07] iteration 8854/ 11920 | consumed samples: 9066496 | elapsed time per iteration (ms): 5617.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829037E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:47:02.695265 | finish at 2025-09-10 11:51:10 + [2025-09-10 07:04:13] iteration 8855/ 11920 | consumed samples: 9067520 | elapsed time per iteration (ms): 5629.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.833721E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:47:35.500207 | finish at 2025-09-10 11:51:49 + [2025-09-10 07:04:19] iteration 8856/ 11920 | consumed samples: 9068544 | elapsed time per iteration (ms): 5919.0 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834075E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:02:15.795271 | finish at 2025-09-10 12:06:35 + [2025-09-10 07:04:25] iteration 8857/ 11920 | consumed samples: 9069568 | elapsed time per iteration (ms): 5617.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822732E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:46:45.032706 | finish at 2025-09-10 11:51:10 + [2025-09-10 07:04:30] iteration 8858/ 11920 | consumed samples: 9070592 | elapsed time per iteration (ms): 5628.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843562E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:47:15.319637 | finish at 2025-09-10 11:51:46 + [2025-09-10 07:04:36] iteration 8859/ 11920 | consumed samples: 9071616 | elapsed time per iteration (ms): 5640.7 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825083E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:47:46.102734 | finish at 2025-09-10 11:52:22 + [2025-09-10 07:04:42] iteration 8860/ 11920 | consumed samples: 9072640 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824629E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:46:44.131207 | finish at 2025-09-10 11:51:26 + [2025-09-10 07:04:47] iteration 8861/ 11920 | consumed samples: 9073664 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818648E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:46:35.194172 | finish at 2025-09-10 11:51:22 + [2025-09-10 07:04:53] iteration 8862/ 11920 | consumed samples: 9074688 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816394E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:46:43.656706 | finish at 2025-09-10 11:51:36 + [2025-09-10 07:04:58] iteration 8863/ 11920 | consumed samples: 9075712 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823752E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:46:23.551673 | finish at 2025-09-10 11:51:22 + [2025-09-10 07:05:04] iteration 8864/ 11920 | consumed samples: 9076736 | elapsed time per iteration (ms): 5619.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822195E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:46:13.587395 | finish at 2025-09-10 11:51:18 + [2025-09-10 07:05:10] iteration 8865/ 11920 | consumed samples: 9077760 | elapsed time per iteration (ms): 5618.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827932E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:46:05.732402 | finish at 2025-09-10 11:51:15 + [2025-09-10 07:05:15] iteration 8866/ 11920 | consumed samples: 9078784 | elapsed time per iteration (ms): 5617.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824730E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:45:56.180872 | finish at 2025-09-10 11:51:11 + [2025-09-10 07:05:21] iteration 8867/ 11920 | consumed samples: 9079808 | elapsed time per iteration (ms): 5618.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819984E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:45:52.303652 | finish at 2025-09-10 11:51:13 + [2025-09-10 07:05:27] iteration 8868/ 11920 | consumed samples: 9080832 | elapsed time per iteration (ms): 5935.2 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821703E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:01:54.177208 | finish at 2025-09-10 12:07:21 + [2025-09-10 07:05:32] iteration 8869/ 11920 | consumed samples: 9081856 | elapsed time per iteration (ms): 5636.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818812E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:46:37.083345 | finish at 2025-09-10 11:52:10 + [2025-09-10 07:05:38] iteration 8870/ 11920 | consumed samples: 9082880 | elapsed time per iteration (ms): 5619.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829549E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:45:38.531613 | finish at 2025-09-10 11:51:17 + [2025-09-10 07:05:44] iteration 8871/ 11920 | consumed samples: 9083904 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824073E+00 | loss scale: 1.0 | grad norm: 0.248 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:45:48.725510 | finish at 2025-09-10 11:51:32 + [2025-09-10 07:05:50] iteration 8872/ 11920 | consumed samples: 9084928 | elapsed time per iteration (ms): 5927.9 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824633E+00 | loss scale: 1.0 | grad norm: 0.244 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:01:08.253239 | finish at 2025-09-10 12:06:58 + [2025-09-10 07:05:55] iteration 8873/ 11920 | consumed samples: 9085952 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824195E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:45:39.029204 | finish at 2025-09-10 11:51:34 + [2025-09-10 07:06:01] iteration 8874/ 11920 | consumed samples: 9086976 | elapsed time per iteration (ms): 6093.6 | throughput per GPU (TFLOP/s/GPU): 74.1 | MFU 7.49% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.833211E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:09:20.956104 | finish at 2025-09-10 12:15:22 + [2025-09-10 07:06:07] iteration 8875/ 11920 | consumed samples: 9088000 | elapsed time per iteration (ms): 5939.5 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809835E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:01:25.782866 | finish at 2025-09-10 12:07:33 + [2025-09-10 07:06:13] iteration 8876/ 11920 | consumed samples: 9089024 | elapsed time per iteration (ms): 5617.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841983E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:44:58.205647 | finish at 2025-09-10 11:51:11 + [2025-09-10 07:06:19] iteration 8877/ 11920 | consumed samples: 9090048 | elapsed time per iteration (ms): 5899.7 | throughput per GPU (TFLOP/s/GPU): 76.5 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832016E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:59:12.857250 | finish at 2025-09-10 12:05:32 + [2025-09-10 07:06:24] iteration 8878/ 11920 | consumed samples: 9091072 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824191E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:45:04.845872 | finish at 2025-09-10 11:51:29 + [2025-09-10 07:06:30] iteration 8879/ 11920 | consumed samples: 9092096 | elapsed time per iteration (ms): 5617.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832524E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:44:41.571375 | finish at 2025-09-10 11:51:12 + [2025-09-10 07:06:36] iteration 8880/ 11920 | consumed samples: 9093120 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842158E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:44:57.276955 | finish at 2025-09-10 11:51:33 + [2025-09-10 07:06:41] iteration 8881/ 11920 | consumed samples: 9094144 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830418E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:44:47.810540 | finish at 2025-09-10 11:51:29 + [2025-09-10 07:06:47] iteration 8882/ 11920 | consumed samples: 9095168 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826819E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:44:42.047183 | finish at 2025-09-10 11:51:29 + [2025-09-10 07:06:53] iteration 8883/ 11920 | consumed samples: 9096192 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826049E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:44:33.777887 | finish at 2025-09-10 11:51:26 + [2025-09-10 07:06:58] iteration 8884/ 11920 | consumed samples: 9097216 | elapsed time per iteration (ms): 5635.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809570E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:45:10.319575 | finish at 2025-09-10 11:52:08 + [2025-09-10 07:07:04] iteration 8885/ 11920 | consumed samples: 9098240 | elapsed time per iteration (ms): 5993.8 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831124E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:03:11.286970 | finish at 2025-09-10 12:10:15 + [2025-09-10 07:07:10] iteration 8886/ 11920 | consumed samples: 9099264 | elapsed time per iteration (ms): 5616.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814084E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:44:00.713876 | finish at 2025-09-10 11:51:10 + [2025-09-10 07:07:15] iteration 8887/ 11920 | consumed samples: 9100288 | elapsed time per iteration (ms): 5619.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837791E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:44:02.805789 | finish at 2025-09-10 11:51:18 + [2025-09-10 07:07:21] iteration 8888/ 11920 | consumed samples: 9101312 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823107E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:44:03.230707 | finish at 2025-09-10 11:51:24 + [2025-09-10 07:07:27] iteration 8889/ 11920 | consumed samples: 9102336 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820206E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:44:03.794722 | finish at 2025-09-10 11:51:30 + [2025-09-10 07:07:32] iteration 8890/ 11920 | consumed samples: 9103360 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826415E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:43:50.406396 | finish at 2025-09-10 11:51:23 + [2025-09-10 07:07:38] iteration 8891/ 11920 | consumed samples: 9104384 | elapsed time per iteration (ms): 5617.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822115E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:43:36.306082 | finish at 2025-09-10 11:51:14 + [2025-09-10 07:07:43] iteration 8892/ 11920 | consumed samples: 9105408 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816778E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:43:53.553298 | finish at 2025-09-10 11:51:37 + [2025-09-10 07:07:49] iteration 8893/ 11920 | consumed samples: 9106432 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822608E+00 | loss scale: 1.0 | grad norm: 0.238 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:43:35.703913 | finish at 2025-09-10 11:51:25 + [2025-09-10 07:07:55] iteration 8894/ 11920 | consumed samples: 9107456 | elapsed time per iteration (ms): 5630.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829756E+00 | loss scale: 1.0 | grad norm: 0.251 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:43:58.388876 | finish at 2025-09-10 11:51:53 + [2025-09-10 07:08:01] iteration 8895/ 11920 | consumed samples: 9108480 | elapsed time per iteration (ms): 5852.7 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839923E+00 | loss scale: 1.0 | grad norm: 0.267 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:55:04.467249 | finish at 2025-09-10 12:03:05 + [2025-09-10 07:08:06] iteration 8896/ 11920 | consumed samples: 9109504 | elapsed time per iteration (ms): 5878.5 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824968E+00 | loss scale: 1.0 | grad norm: 0.272 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:56:16.526997 | finish at 2025-09-10 12:04:23 + [2025-09-10 07:08:12] iteration 8897/ 11920 | consumed samples: 9110528 | elapsed time per iteration (ms): 5995.8 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818855E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:02:05.201831 | finish at 2025-09-10 12:10:18 + [2025-09-10 07:08:18] iteration 8898/ 11920 | consumed samples: 9111552 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832138E+00 | loss scale: 1.0 | grad norm: 0.246 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:43:06.108811 | finish at 2025-09-10 11:51:24 + [2025-09-10 07:08:24] iteration 8899/ 11920 | consumed samples: 9112576 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818247E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:43:13.317310 | finish at 2025-09-10 11:51:37 + [2025-09-10 07:08:29] iteration 8900/ 11920 | consumed samples: 9113600 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839068E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:42:55.244470 | finish at 2025-09-10 11:51:25 + [2025-09-10 07:08:35] iteration 8901/ 11920 | consumed samples: 9114624 | elapsed time per iteration (ms): 5626.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838437E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:43:07.582901 | finish at 2025-09-10 11:51:43 + [2025-09-10 07:08:41] iteration 8902/ 11920 | consumed samples: 9115648 | elapsed time per iteration (ms): 5633.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844109E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:43:22.642274 | finish at 2025-09-10 11:52:03 + [2025-09-10 07:08:47] iteration 8903/ 11920 | consumed samples: 9116672 | elapsed time per iteration (ms): 5944.9 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832552E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:58:55.843251 | finish at 2025-09-10 12:07:42 + [2025-09-10 07:08:52] iteration 8904/ 11920 | consumed samples: 9117696 | elapsed time per iteration (ms): 5832.2 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810572E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:53:10.018179 | finish at 2025-09-10 12:02:02 + [2025-09-10 07:08:59] iteration 8905/ 11920 | consumed samples: 9118720 | elapsed time per iteration (ms): 6235.2 | throughput per GPU (TFLOP/s/GPU): 72.4 | MFU 7.32% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825695E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:13:19.112688 | finish at 2025-09-10 12:22:18 + [2025-09-10 07:09:04] iteration 8906/ 11920 | consumed samples: 9119744 | elapsed time per iteration (ms): 5634.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831728E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:43:03.337373 | finish at 2025-09-10 11:52:08 + [2025-09-10 07:09:10] iteration 8907/ 11920 | consumed samples: 9120768 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830669E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:42:13.934612 | finish at 2025-09-10 11:51:24 + [2025-09-10 07:09:15] iteration 8908/ 11920 | consumed samples: 9121792 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828914E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:42:06.870907 | finish at 2025-09-10 11:51:22 + [2025-09-10 07:09:21] iteration 8909/ 11920 | consumed samples: 9122816 | elapsed time per iteration (ms): 5615.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823065E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:41:47.495829 | finish at 2025-09-10 11:51:09 + [2025-09-10 07:09:27] iteration 8910/ 11920 | consumed samples: 9123840 | elapsed time per iteration (ms): 5623.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817706E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:42:05.516057 | finish at 2025-09-10 11:51:32 + [2025-09-10 07:09:32] iteration 8911/ 11920 | consumed samples: 9124864 | elapsed time per iteration (ms): 5627.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838901E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:42:12.339161 | finish at 2025-09-10 11:51:45 + [2025-09-10 07:09:38] iteration 8912/ 11920 | consumed samples: 9125888 | elapsed time per iteration (ms): 5615.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825858E+00 | loss scale: 1.0 | grad norm: 0.136 | num zeros: 10.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:41:30.132309 | finish at 2025-09-10 11:51:08 + [2025-09-10 07:09:44] iteration 8913/ 11920 | consumed samples: 9126912 | elapsed time per iteration (ms): 5618.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827672E+00 | loss scale: 1.0 | grad norm: 0.132 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:41:33.496720 | finish at 2025-09-10 11:51:17 + [2025-09-10 07:09:49] iteration 8914/ 11920 | consumed samples: 9127936 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806036E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:41:37.498743 | finish at 2025-09-10 11:51:27 + [2025-09-10 07:09:55] iteration 8915/ 11920 | consumed samples: 9128960 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829135E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:41:37.667100 | finish at 2025-09-10 11:51:33 + [2025-09-10 07:10:01] iteration 8916/ 11920 | consumed samples: 9129984 | elapsed time per iteration (ms): 5905.2 | throughput per GPU (TFLOP/s/GPU): 76.5 | MFU 7.73% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830470E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:55:39.280975 | finish at 2025-09-10 12:05:40 + [2025-09-10 07:10:06] iteration 8917/ 11920 | consumed samples: 9131008 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827758E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:41:30.851877 | finish at 2025-09-10 11:51:37 + [2025-09-10 07:10:12] iteration 8918/ 11920 | consumed samples: 9132032 | elapsed time per iteration (ms): 5627.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822622E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:41:32.713065 | finish at 2025-09-10 11:51:45 + [2025-09-10 07:10:18] iteration 8919/ 11920 | consumed samples: 9133056 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826023E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:41:17.378088 | finish at 2025-09-10 11:51:35 + [2025-09-10 07:10:23] iteration 8920/ 11920 | consumed samples: 9134080 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822912E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:41:00.473156 | finish at 2025-09-10 11:51:24 + [2025-09-10 07:10:29] iteration 8921/ 11920 | consumed samples: 9135104 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835820E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:41:07.828418 | finish at 2025-09-10 11:51:37 + [2025-09-10 07:10:34] iteration 8922/ 11920 | consumed samples: 9136128 | elapsed time per iteration (ms): 5632.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822185E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:41:26.814486 | finish at 2025-09-10 11:52:01 + [2025-09-10 07:10:40] iteration 8923/ 11920 | consumed samples: 9137152 | elapsed time per iteration (ms): 5616.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823613E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 10.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:40:33.249702 | finish at 2025-09-10 11:51:13 + [2025-09-10 07:10:46] iteration 8924/ 11920 | consumed samples: 9138176 | elapsed time per iteration (ms): 5958.1 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816169E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:57:30.379268 | finish at 2025-09-10 12:08:16 + [2025-09-10 07:10:52] iteration 8925/ 11920 | consumed samples: 9139200 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806015E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:40:45.901731 | finish at 2025-09-10 11:51:38 + [2025-09-10 07:10:57] iteration 8926/ 11920 | consumed samples: 9140224 | elapsed time per iteration (ms): 5622.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821594E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:40:33.848346 | finish at 2025-09-10 11:51:31 + [2025-09-10 07:11:03] iteration 8927/ 11920 | consumed samples: 9141248 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822538E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:40:31.955737 | finish at 2025-09-10 11:51:35 + [2025-09-10 07:11:09] iteration 8928/ 11920 | consumed samples: 9142272 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822310E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:40:26.067310 | finish at 2025-09-10 11:51:35 + [2025-09-10 07:11:14] iteration 8929/ 11920 | consumed samples: 9143296 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816622E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:40:13.818120 | finish at 2025-09-10 11:51:28 + [2025-09-10 07:11:20] iteration 8930/ 11920 | consumed samples: 9144320 | elapsed time per iteration (ms): 5626.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819747E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:40:24.514279 | finish at 2025-09-10 11:51:44 + [2025-09-10 07:11:26] iteration 8931/ 11920 | consumed samples: 9145344 | elapsed time per iteration (ms): 5823.2 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.84% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812985E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:50:05.600348 | finish at 2025-09-10 12:01:31 + [2025-09-10 07:11:32] iteration 8932/ 11920 | consumed samples: 9146368 | elapsed time per iteration (ms): 5915.8 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831573E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:54:36.472661 | finish at 2025-09-10 12:06:08 + [2025-09-10 07:11:37] iteration 8933/ 11920 | consumed samples: 9147392 | elapsed time per iteration (ms): 5633.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830126E+00 | loss scale: 1.0 | grad norm: 0.248 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:40:27.420046 | finish at 2025-09-10 11:52:05 + [2025-09-10 07:11:43] iteration 8934/ 11920 | consumed samples: 9148416 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835292E+00 | loss scale: 1.0 | grad norm: 0.243 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:39:47.250646 | finish at 2025-09-10 11:51:30 + [2025-09-10 07:11:48] iteration 8935/ 11920 | consumed samples: 9149440 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844497E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:39:41.958879 | finish at 2025-09-10 11:51:30 + [2025-09-10 07:11:54] iteration 8936/ 11920 | consumed samples: 9150464 | elapsed time per iteration (ms): 5857.8 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811634E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:51:19.781937 | finish at 2025-09-10 12:03:14 + [2025-09-10 07:12:01] iteration 8937/ 11920 | consumed samples: 9151488 | elapsed time per iteration (ms): 6294.4 | throughput per GPU (TFLOP/s/GPU): 71.7 | MFU 7.25% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829548E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:12:56.195131 | finish at 2025-09-10 12:24:57 + [2025-09-10 07:12:06] iteration 8938/ 11920 | consumed samples: 9152512 | elapsed time per iteration (ms): 5623.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816591E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:39:27.949954 | finish at 2025-09-10 11:51:34 + [2025-09-10 07:12:12] iteration 8939/ 11920 | consumed samples: 9153536 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811956E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:39:23.620420 | finish at 2025-09-10 11:51:35 + [2025-09-10 07:12:17] iteration 8940/ 11920 | consumed samples: 9154560 | elapsed time per iteration (ms): 5616.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830093E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:38:57.361536 | finish at 2025-09-10 11:51:15 + [2025-09-10 07:12:23] iteration 8941/ 11920 | consumed samples: 9155584 | elapsed time per iteration (ms): 5814.9 | throughput per GPU (TFLOP/s/GPU): 77.6 | MFU 7.85% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836689E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:48:42.518682 | finish at 2025-09-10 12:01:06 + [2025-09-10 07:12:29] iteration 8942/ 11920 | consumed samples: 9156608 | elapsed time per iteration (ms): 5618.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826960E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:38:50.734246 | finish at 2025-09-10 11:51:20 + [2025-09-10 07:12:35] iteration 8943/ 11920 | consumed samples: 9157632 | elapsed time per iteration (ms): 5626.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814230E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:39:10.795690 | finish at 2025-09-10 11:51:45 + [2025-09-10 07:12:40] iteration 8944/ 11920 | consumed samples: 9158656 | elapsed time per iteration (ms): 5616.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826713E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:38:34.816521 | finish at 2025-09-10 11:51:15 + [2025-09-10 07:12:46] iteration 8945/ 11920 | consumed samples: 9159680 | elapsed time per iteration (ms): 5929.2 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.833090E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:53:59.413780 | finish at 2025-09-10 12:06:45 + [2025-09-10 07:12:52] iteration 8946/ 11920 | consumed samples: 9160704 | elapsed time per iteration (ms): 5843.1 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820468E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:49:37.376066 | finish at 2025-09-10 12:02:29 + [2025-09-10 07:12:58] iteration 8947/ 11920 | consumed samples: 9161728 | elapsed time per iteration (ms): 5644.4 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818800E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:39:40.690272 | finish at 2025-09-10 11:52:38 + [2025-09-10 07:13:03] iteration 8948/ 11920 | consumed samples: 9162752 | elapsed time per iteration (ms): 5632.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813239E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:38:58.878568 | finish at 2025-09-10 11:52:02 + [2025-09-10 07:13:09] iteration 8949/ 11920 | consumed samples: 9163776 | elapsed time per iteration (ms): 5621.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814059E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:38:19.857985 | finish at 2025-09-10 11:51:29 + [2025-09-10 07:13:14] iteration 8950/ 11920 | consumed samples: 9164800 | elapsed time per iteration (ms): 5614.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819909E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:37:55.445387 | finish at 2025-09-10 11:51:10 + [2025-09-10 07:13:20] iteration 8951/ 11920 | consumed samples: 9165824 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817993E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:38:09.333141 | finish at 2025-09-10 11:51:29 + [2025-09-10 07:13:26] iteration 8952/ 11920 | consumed samples: 9166848 | elapsed time per iteration (ms): 5617.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834478E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:37:52.764257 | finish at 2025-09-10 11:51:18 + [2025-09-10 07:13:31] iteration 8953/ 11920 | consumed samples: 9167872 | elapsed time per iteration (ms): 5626.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823313E+00 | loss scale: 1.0 | grad norm: 0.130 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:38:12.314904 | finish at 2025-09-10 11:51:44 + [2025-09-10 07:13:37] iteration 8954/ 11920 | consumed samples: 9168896 | elapsed time per iteration (ms): 5625.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816001E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:38:04.820624 | finish at 2025-09-10 11:51:42 + [2025-09-10 07:13:43] iteration 8955/ 11920 | consumed samples: 9169920 | elapsed time per iteration (ms): 5619.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827938E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:37:40.420412 | finish at 2025-09-10 11:51:23 + [2025-09-10 07:13:48] iteration 8956/ 11920 | consumed samples: 9170944 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818087E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:37:43.977528 | finish at 2025-09-10 11:51:32 + [2025-09-10 07:13:54] iteration 8957/ 11920 | consumed samples: 9171968 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830189E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:37:43.410646 | finish at 2025-09-10 11:51:37 + [2025-09-10 07:14:00] iteration 8958/ 11920 | consumed samples: 9172992 | elapsed time per iteration (ms): 5847.5 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836499E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:48:40.325621 | finish at 2025-09-10 12:02:40 + [2025-09-10 07:14:05] iteration 8959/ 11920 | consumed samples: 9174016 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831981E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:37:21.682341 | finish at 2025-09-10 11:51:27 + [2025-09-10 07:14:11] iteration 8960/ 11920 | consumed samples: 9175040 | elapsed time per iteration (ms): 5981.4 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832780E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:55:04.832535 | finish at 2025-09-10 12:09:16 + [2025-09-10 07:14:17] iteration 8961/ 11920 | consumed samples: 9176064 | elapsed time per iteration (ms): 5618.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821763E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:37:05.094921 | finish at 2025-09-10 11:51:22 + [2025-09-10 07:14:22] iteration 8962/ 11920 | consumed samples: 9177088 | elapsed time per iteration (ms): 5618.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819164E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:36:58.283873 | finish at 2025-09-10 11:51:21 + [2025-09-10 07:14:28] iteration 8963/ 11920 | consumed samples: 9178112 | elapsed time per iteration (ms): 5617.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821294E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:36:50.900462 | finish at 2025-09-10 11:51:19 + [2025-09-10 07:14:34] iteration 8964/ 11920 | consumed samples: 9179136 | elapsed time per iteration (ms): 5628.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826306E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:37:16.682387 | finish at 2025-09-10 11:51:50 + [2025-09-10 07:14:39] iteration 8965/ 11920 | consumed samples: 9180160 | elapsed time per iteration (ms): 5623.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830326E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:36:56.207081 | finish at 2025-09-10 11:51:36 + [2025-09-10 07:14:45] iteration 8966/ 11920 | consumed samples: 9181184 | elapsed time per iteration (ms): 5847.7 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825651E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:47:53.978675 | finish at 2025-09-10 12:02:39 + [2025-09-10 07:14:51] iteration 8967/ 11920 | consumed samples: 9182208 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838321E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:36:50.382807 | finish at 2025-09-10 11:51:41 + [2025-09-10 07:14:56] iteration 8968/ 11920 | consumed samples: 9183232 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837868E+00 | loss scale: 1.0 | grad norm: 0.249 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:36:35.473206 | finish at 2025-09-10 11:51:32 + [2025-09-10 07:15:02] iteration 8969/ 11920 | consumed samples: 9184256 | elapsed time per iteration (ms): 5640.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825325E+00 | loss scale: 1.0 | grad norm: 0.256 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:37:24.616869 | finish at 2025-09-10 11:52:27 + [2025-09-10 07:15:08] iteration 8970/ 11920 | consumed samples: 9185280 | elapsed time per iteration (ms): 5629.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821391E+00 | loss scale: 1.0 | grad norm: 0.265 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:36:45.977476 | finish at 2025-09-10 11:51:54 + [2025-09-10 07:15:14] iteration 8971/ 11920 | consumed samples: 9186304 | elapsed time per iteration (ms): 5966.5 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828130E+00 | loss scale: 1.0 | grad norm: 0.251 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:53:15.215660 | finish at 2025-09-10 12:08:29 + [2025-09-10 07:15:19] iteration 8972/ 11920 | consumed samples: 9187328 | elapsed time per iteration (ms): 5619.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838516E+00 | loss scale: 1.0 | grad norm: 0.243 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:36:05.175254 | finish at 2025-09-10 11:51:24 + [2025-09-10 07:15:25] iteration 8973/ 11920 | consumed samples: 9188352 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836892E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:36:06.882345 | finish at 2025-09-10 11:51:32 + [2025-09-10 07:15:31] iteration 8974/ 11920 | consumed samples: 9189376 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826711E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:36:14.964894 | finish at 2025-09-10 11:51:45 + [2025-09-10 07:15:36] iteration 8975/ 11920 | consumed samples: 9190400 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835149E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:36:13.026989 | finish at 2025-09-10 11:51:49 + [2025-09-10 07:15:42] iteration 8976/ 11920 | consumed samples: 9191424 | elapsed time per iteration (ms): 5981.3 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815036E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:53:29.085815 | finish at 2025-09-10 12:09:11 + [2025-09-10 07:15:48] iteration 8977/ 11920 | consumed samples: 9192448 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831822E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:35:49.301254 | finish at 2025-09-10 11:51:37 + [2025-09-10 07:15:53] iteration 8978/ 11920 | consumed samples: 9193472 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818689E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:35:49.550329 | finish at 2025-09-10 11:51:43 + [2025-09-10 07:15:59] iteration 8979/ 11920 | consumed samples: 9194496 | elapsed time per iteration (ms): 5629.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820583E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:35:54.761933 | finish at 2025-09-10 11:51:54 + [2025-09-10 07:16:05] iteration 8980/ 11920 | consumed samples: 9195520 | elapsed time per iteration (ms): 5628.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822841E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:35:47.288775 | finish at 2025-09-10 11:51:52 + [2025-09-10 07:16:10] iteration 8981/ 11920 | consumed samples: 9196544 | elapsed time per iteration (ms): 5616.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830008E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:35:06.933149 | finish at 2025-09-10 11:51:17 + [2025-09-10 07:16:16] iteration 8982/ 11920 | consumed samples: 9197568 | elapsed time per iteration (ms): 5618.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838662E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:35:08.090217 | finish at 2025-09-10 11:51:24 + [2025-09-10 07:16:21] iteration 8983/ 11920 | consumed samples: 9198592 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815919E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:35:16.261133 | finish at 2025-09-10 11:51:38 + [2025-09-10 07:16:27] iteration 8984/ 11920 | consumed samples: 9199616 | elapsed time per iteration (ms): 5618.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838682E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:34:57.085678 | finish at 2025-09-10 11:51:24 + [2025-09-10 07:16:33] iteration 8985/ 11920 | consumed samples: 9200640 | elapsed time per iteration (ms): 6161.1 | throughput per GPU (TFLOP/s/GPU): 73.3 | MFU 7.41% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816429E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 5:01:22.702981 | finish at 2025-09-10 12:17:56 + [2025-09-10 07:16:39] iteration 8986/ 11920 | consumed samples: 9201664 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828125E+00 | loss scale: 1.0 | grad norm: 0.230 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:34:51.840670 | finish at 2025-09-10 11:51:31 + [2025-09-10 07:16:45] iteration 8987/ 11920 | consumed samples: 9202688 | elapsed time per iteration (ms): 5631.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817022E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:35:15.678369 | finish at 2025-09-10 11:52:00 + [2025-09-10 07:16:50] iteration 8988/ 11920 | consumed samples: 9203712 | elapsed time per iteration (ms): 5626.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814009E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:34:57.424760 | finish at 2025-09-10 11:51:48 + [2025-09-10 07:16:56] iteration 8989/ 11920 | consumed samples: 9204736 | elapsed time per iteration (ms): 5916.5 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840711E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:49:01.134586 | finish at 2025-09-10 12:05:57 + [2025-09-10 07:17:02] iteration 8990/ 11920 | consumed samples: 9205760 | elapsed time per iteration (ms): 5628.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818171E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 10.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:34:50.818264 | finish at 2025-09-10 11:51:53 + [2025-09-10 07:17:08] iteration 8991/ 11920 | consumed samples: 9206784 | elapsed time per iteration (ms): 5993.2 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839840E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:52:34.054355 | finish at 2025-09-10 12:09:42 + [2025-09-10 07:17:13] iteration 8992/ 11920 | consumed samples: 9207808 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817311E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:34:13.860867 | finish at 2025-09-10 11:51:27 + [2025-09-10 07:17:19] iteration 8993/ 11920 | consumed samples: 9208832 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819587E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:34:08.293020 | finish at 2025-09-10 11:51:27 + [2025-09-10 07:17:25] iteration 8994/ 11920 | consumed samples: 9209856 | elapsed time per iteration (ms): 5889.6 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817745E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:47:13.059007 | finish at 2025-09-10 12:04:38 + [2025-09-10 07:17:30] iteration 8995/ 11920 | consumed samples: 9210880 | elapsed time per iteration (ms): 5617.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837017E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:33:52.067782 | finish at 2025-09-10 11:51:23 + [2025-09-10 07:17:36] iteration 8996/ 11920 | consumed samples: 9211904 | elapsed time per iteration (ms): 5627.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829185E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:34:16.073378 | finish at 2025-09-10 11:51:52 + [2025-09-10 07:17:42] iteration 8997/ 11920 | consumed samples: 9212928 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824408E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:33:53.823743 | finish at 2025-09-10 11:51:36 + [2025-09-10 07:17:47] iteration 8998/ 11920 | consumed samples: 9213952 | elapsed time per iteration (ms): 5615.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821333E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:33:29.197334 | finish at 2025-09-10 11:51:17 + [2025-09-10 07:17:53] iteration 8999/ 11920 | consumed samples: 9214976 | elapsed time per iteration (ms): 5627.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825896E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:33:56.862840 | finish at 2025-09-10 11:51:50 + [2025-09-10 07:17:59] iteration 9000/ 11920 | consumed samples: 9216000 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813795E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:33:47.195759 | finish at 2025-09-10 11:51:46 + [2025-09-10 07:18:04] iteration 9001/ 11920 | consumed samples: 9217024 | elapsed time per iteration (ms): 5863.7 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828956E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:45:16.246526 | finish at 2025-09-10 12:03:21 + [2025-09-10 07:18:10] iteration 9002/ 11920 | consumed samples: 9218048 | elapsed time per iteration (ms): 5637.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826933E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:34:09.527347 | finish at 2025-09-10 11:52:20 + [2025-09-10 07:18:16] iteration 9003/ 11920 | consumed samples: 9219072 | elapsed time per iteration (ms): 5640.8 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828092E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:34:14.085632 | finish at 2025-09-10 11:52:30 + [2025-09-10 07:18:21] iteration 9004/ 11920 | consumed samples: 9220096 | elapsed time per iteration (ms): 5635.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825689E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:33:51.564031 | finish at 2025-09-10 11:52:13 + [2025-09-10 07:18:27] iteration 9005/ 11920 | consumed samples: 9221120 | elapsed time per iteration (ms): 5642.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815284E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:34:06.596681 | finish at 2025-09-10 11:52:34 + [2025-09-10 07:18:33] iteration 9006/ 11920 | consumed samples: 9222144 | elapsed time per iteration (ms): 5637.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821286E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:33:47.746004 | finish at 2025-09-10 11:52:20 + [2025-09-10 07:18:38] iteration 9007/ 11920 | consumed samples: 9223168 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818706E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:33:01.107885 | finish at 2025-09-10 11:51:39 + [2025-09-10 07:18:44] iteration 9008/ 11920 | consumed samples: 9224192 | elapsed time per iteration (ms): 5628.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820811E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:33:10.850128 | finish at 2025-09-10 11:51:55 + [2025-09-10 07:18:49] iteration 9009/ 11920 | consumed samples: 9225216 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821716E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:32:41.919821 | finish at 2025-09-10 11:51:31 + [2025-09-10 07:18:55] iteration 9010/ 11920 | consumed samples: 9226240 | elapsed time per iteration (ms): 5936.2 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826239E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:47:54.282746 | finish at 2025-09-10 12:06:50 + [2025-09-10 07:19:01] iteration 9011/ 11920 | consumed samples: 9227264 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819623E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:32:39.998433 | finish at 2025-09-10 11:51:41 + [2025-09-10 07:19:07] iteration 9012/ 11920 | consumed samples: 9228288 | elapsed time per iteration (ms): 5974.6 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818300E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:49:34.172382 | finish at 2025-09-10 12:08:41 + [2025-09-10 07:19:13] iteration 9013/ 11920 | consumed samples: 9229312 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815277E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:32:22.460856 | finish at 2025-09-10 11:51:35 + [2025-09-10 07:19:18] iteration 9014/ 11920 | consumed samples: 9230336 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817297E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:32:13.845314 | finish at 2025-09-10 11:51:32 + [2025-09-10 07:19:24] iteration 9015/ 11920 | consumed samples: 9231360 | elapsed time per iteration (ms): 5921.6 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832735E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:46:42.320331 | finish at 2025-09-10 12:06:07 + [2025-09-10 07:19:30] iteration 9016/ 11920 | consumed samples: 9232384 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832261E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:31:59.045082 | finish at 2025-09-10 11:51:29 + [2025-09-10 07:19:35] iteration 9017/ 11920 | consumed samples: 9233408 | elapsed time per iteration (ms): 5618.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814611E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:31:51.055726 | finish at 2025-09-10 11:51:26 + [2025-09-10 07:19:41] iteration 9018/ 11920 | consumed samples: 9234432 | elapsed time per iteration (ms): 5955.1 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.833951E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:48:01.689994 | finish at 2025-09-10 12:07:43 + [2025-09-10 07:19:47] iteration 9019/ 11920 | consumed samples: 9235456 | elapsed time per iteration (ms): 5619.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839947E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:31:41.135945 | finish at 2025-09-10 11:51:28 + [2025-09-10 07:19:53] iteration 9020/ 11920 | consumed samples: 9236480 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829210E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:31:50.777688 | finish at 2025-09-10 11:51:43 + [2025-09-10 07:19:58] iteration 9021/ 11920 | consumed samples: 9237504 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814774E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:31:46.745059 | finish at 2025-09-10 11:51:45 + [2025-09-10 07:20:04] iteration 9022/ 11920 | consumed samples: 9238528 | elapsed time per iteration (ms): 5626.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813038E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:31:44.271468 | finish at 2025-09-10 11:51:48 + [2025-09-10 07:20:10] iteration 9023/ 11920 | consumed samples: 9239552 | elapsed time per iteration (ms): 5625.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826341E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:31:36.750148 | finish at 2025-09-10 11:51:46 + [2025-09-10 07:20:15] iteration 9024/ 11920 | consumed samples: 9240576 | elapsed time per iteration (ms): 5625.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827037E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:31:31.105427 | finish at 2025-09-10 11:51:46 + [2025-09-10 07:20:21] iteration 9025/ 11920 | consumed samples: 9241600 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820827E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:31:11.326357 | finish at 2025-09-10 11:51:32 + [2025-09-10 07:20:26] iteration 9026/ 11920 | consumed samples: 9242624 | elapsed time per iteration (ms): 5628.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834150E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:31:29.049382 | finish at 2025-09-10 11:51:55 + [2025-09-10 07:20:32] iteration 9027/ 11920 | consumed samples: 9243648 | elapsed time per iteration (ms): 5638.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827657E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:31:50.865774 | finish at 2025-09-10 11:52:23 + [2025-09-10 07:20:38] iteration 9028/ 11920 | consumed samples: 9244672 | elapsed time per iteration (ms): 5628.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826439E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:31:16.556668 | finish at 2025-09-10 11:51:54 + [2025-09-10 07:20:44] iteration 9029/ 11920 | consumed samples: 9245696 | elapsed time per iteration (ms): 5926.8 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819143E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:45:34.320249 | finish at 2025-09-10 12:06:18 + [2025-09-10 07:20:49] iteration 9030/ 11920 | consumed samples: 9246720 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817855E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:30:47.692940 | finish at 2025-09-10 11:51:37 + [2025-09-10 07:20:55] iteration 9031/ 11920 | consumed samples: 9247744 | elapsed time per iteration (ms): 5635.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817091E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:31:19.748473 | finish at 2025-09-10 11:52:15 + [2025-09-10 07:21:00] iteration 9032/ 11920 | consumed samples: 9248768 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820203E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:30:40.131243 | finish at 2025-09-10 11:51:41 + [2025-09-10 07:21:06] iteration 9033/ 11920 | consumed samples: 9249792 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815781E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:30:39.857508 | finish at 2025-09-10 11:51:46 + [2025-09-10 07:21:12] iteration 9034/ 11920 | consumed samples: 9250816 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805161E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:30:31.482790 | finish at 2025-09-10 11:51:43 + [2025-09-10 07:21:18] iteration 9035/ 11920 | consumed samples: 9251840 | elapsed time per iteration (ms): 5836.9 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811045E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:40:39.455112 | finish at 2025-09-10 12:01:57 + [2025-09-10 07:21:23] iteration 9036/ 11920 | consumed samples: 9252864 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821132E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:30:17.194485 | finish at 2025-09-10 11:51:40 + [2025-09-10 07:21:29] iteration 9037/ 11920 | consumed samples: 9253888 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829044E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:30:04.768515 | finish at 2025-09-10 11:51:34 + [2025-09-10 07:21:34] iteration 9038/ 11920 | consumed samples: 9254912 | elapsed time per iteration (ms): 5626.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835095E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:30:16.033059 | finish at 2025-09-10 11:51:50 + [2025-09-10 07:21:40] iteration 9039/ 11920 | consumed samples: 9255936 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809355E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:29:57.228531 | finish at 2025-09-10 11:51:37 + [2025-09-10 07:21:46] iteration 9040/ 11920 | consumed samples: 9256960 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818501E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:29:50.419922 | finish at 2025-09-10 11:51:36 + [2025-09-10 07:21:51] iteration 9041/ 11920 | consumed samples: 9257984 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821462E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:29:45.808640 | finish at 2025-09-10 11:51:37 + [2025-09-10 07:21:57] iteration 9042/ 11920 | consumed samples: 9259008 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811394E+00 | loss scale: 1.0 | grad norm: 0.130 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:29:35.963247 | finish at 2025-09-10 11:51:33 + [2025-09-10 07:22:03] iteration 9043/ 11920 | consumed samples: 9260032 | elapsed time per iteration (ms): 5635.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816010E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:30:11.916922 | finish at 2025-09-10 11:52:14 + [2025-09-10 07:22:08] iteration 9044/ 11920 | consumed samples: 9261056 | elapsed time per iteration (ms): 5945.6 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812918E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:44:59.617006 | finish at 2025-09-10 12:07:08 + [2025-09-10 07:22:14] iteration 9045/ 11920 | consumed samples: 9262080 | elapsed time per iteration (ms): 5616.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.833459E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:29:07.114366 | finish at 2025-09-10 11:51:21 + [2025-09-10 07:22:20] iteration 9046/ 11920 | consumed samples: 9263104 | elapsed time per iteration (ms): 5617.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829939E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:29:03.652295 | finish at 2025-09-10 11:51:23 + [2025-09-10 07:22:26] iteration 9047/ 11920 | consumed samples: 9264128 | elapsed time per iteration (ms): 5848.5 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819975E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:40:02.875479 | finish at 2025-09-10 12:02:28 + [2025-09-10 07:22:31] iteration 9048/ 11920 | consumed samples: 9265152 | elapsed time per iteration (ms): 5849.8 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830880E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:40:00.719725 | finish at 2025-09-10 12:02:32 + [2025-09-10 07:22:37] iteration 9049/ 11920 | consumed samples: 9266176 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807187E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:29:03.215187 | finish at 2025-09-10 11:51:40 + [2025-09-10 07:22:43] iteration 9050/ 11920 | consumed samples: 9267200 | elapsed time per iteration (ms): 5618.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818870E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:28:45.007398 | finish at 2025-09-10 11:51:28 + [2025-09-10 07:22:48] iteration 9051/ 11920 | consumed samples: 9268224 | elapsed time per iteration (ms): 5615.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831966E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:28:31.881777 | finish at 2025-09-10 11:51:20 + [2025-09-10 07:22:54] iteration 9052/ 11920 | consumed samples: 9269248 | elapsed time per iteration (ms): 5634.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811822E+00 | loss scale: 1.0 | grad norm: 0.256 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:29:20.622688 | finish at 2025-09-10 11:52:15 + [2025-09-10 07:23:00] iteration 9053/ 11920 | consumed samples: 9270272 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828501E+00 | loss scale: 1.0 | grad norm: 0.241 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:28:44.699273 | finish at 2025-09-10 11:51:44 + [2025-09-10 07:23:05] iteration 9054/ 11920 | consumed samples: 9271296 | elapsed time per iteration (ms): 5850.7 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812433E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:39:28.065603 | finish at 2025-09-10 12:02:33 + [2025-09-10 07:23:12] iteration 9055/ 11920 | consumed samples: 9272320 | elapsed time per iteration (ms): 6198.4 | throughput per GPU (TFLOP/s/GPU): 72.8 | MFU 7.36% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821576E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:55:58.498710 | finish at 2025-09-10 12:19:10 + [2025-09-10 07:23:18] iteration 9056/ 11920 | consumed samples: 9273344 | elapsed time per iteration (ms): 5976.6 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826173E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:45:16.956257 | finish at 2025-09-10 12:08:35 + [2025-09-10 07:23:23] iteration 9057/ 11920 | consumed samples: 9274368 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824167E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:28:10.370934 | finish at 2025-09-10 11:51:34 + [2025-09-10 07:23:29] iteration 9058/ 11920 | consumed samples: 9275392 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823451E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:28:05.778451 | finish at 2025-09-10 11:51:35 + [2025-09-10 07:23:34] iteration 9059/ 11920 | consumed samples: 9276416 | elapsed time per iteration (ms): 5629.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818883E+00 | loss scale: 1.0 | grad norm: 0.241 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:28:25.096128 | finish at 2025-09-10 11:52:00 + [2025-09-10 07:23:40] iteration 9060/ 11920 | consumed samples: 9277440 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813479E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:27:59.138141 | finish at 2025-09-10 11:51:39 + [2025-09-10 07:23:46] iteration 9061/ 11920 | consumed samples: 9278464 | elapsed time per iteration (ms): 5616.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832689E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:27:37.349639 | finish at 2025-09-10 11:51:23 + [2025-09-10 07:23:51] iteration 9062/ 11920 | consumed samples: 9279488 | elapsed time per iteration (ms): 5613.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827495E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:27:22.707389 | finish at 2025-09-10 11:51:14 + [2025-09-10 07:23:57] iteration 9063/ 11920 | consumed samples: 9280512 | elapsed time per iteration (ms): 5974.1 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820980E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:44:28.046255 | finish at 2025-09-10 12:08:25 + [2025-09-10 07:24:03] iteration 9064/ 11920 | consumed samples: 9281536 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.833024E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:27:37.094479 | finish at 2025-09-10 11:51:40 + [2025-09-10 07:24:09] iteration 9065/ 11920 | consumed samples: 9282560 | elapsed time per iteration (ms): 5836.7 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827671E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:37:43.693988 | finish at 2025-09-10 12:01:52 + [2025-09-10 07:24:14] iteration 9066/ 11920 | consumed samples: 9283584 | elapsed time per iteration (ms): 5629.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837161E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:27:46.264094 | finish at 2025-09-10 11:52:01 + [2025-09-10 07:24:20] iteration 9067/ 11920 | consumed samples: 9284608 | elapsed time per iteration (ms): 5627.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810638E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:27:35.837201 | finish at 2025-09-10 11:51:56 + [2025-09-10 07:24:26] iteration 9068/ 11920 | consumed samples: 9285632 | elapsed time per iteration (ms): 5891.5 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815062E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:40:02.498152 | finish at 2025-09-10 12:04:28 + [2025-09-10 07:24:31] iteration 9069/ 11920 | consumed samples: 9286656 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818417E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:27:08.572761 | finish at 2025-09-10 11:51:40 + [2025-09-10 07:24:37] iteration 9070/ 11920 | consumed samples: 9287680 | elapsed time per iteration (ms): 5630.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817649E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:27:26.759427 | finish at 2025-09-10 11:52:04 + [2025-09-10 07:24:43] iteration 9071/ 11920 | consumed samples: 9288704 | elapsed time per iteration (ms): 5971.6 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821589E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:43:33.184334 | finish at 2025-09-10 12:08:16 + [2025-09-10 07:24:49] iteration 9072/ 11920 | consumed samples: 9289728 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813338E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:26:46.079491 | finish at 2025-09-10 11:51:35 + [2025-09-10 07:24:54] iteration 9073/ 11920 | consumed samples: 9290752 | elapsed time per iteration (ms): 5626.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818387E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:26:59.890069 | finish at 2025-09-10 11:51:54 + [2025-09-10 07:25:00] iteration 9074/ 11920 | consumed samples: 9291776 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834056E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:26:48.467728 | finish at 2025-09-10 11:51:48 + [2025-09-10 07:25:06] iteration 9075/ 11920 | consumed samples: 9292800 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820027E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:26:38.359258 | finish at 2025-09-10 11:51:44 + [2025-09-10 07:25:11] iteration 9076/ 11920 | consumed samples: 9293824 | elapsed time per iteration (ms): 5618.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831452E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:26:19.364542 | finish at 2025-09-10 11:51:31 + [2025-09-10 07:25:17] iteration 9077/ 11920 | consumed samples: 9294848 | elapsed time per iteration (ms): 5640.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816436E+00 | loss scale: 1.0 | grad norm: 0.247 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:27:14.542666 | finish at 2025-09-10 11:52:31 + [2025-09-10 07:25:22] iteration 9078/ 11920 | consumed samples: 9295872 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830578E+00 | loss scale: 1.0 | grad norm: 0.275 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:26:21.155234 | finish at 2025-09-10 11:51:44 + [2025-09-10 07:25:28] iteration 9079/ 11920 | consumed samples: 9296896 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832784E+00 | loss scale: 1.0 | grad norm: 0.246 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:26:16.510793 | finish at 2025-09-10 11:51:45 + [2025-09-10 07:25:34] iteration 9080/ 11920 | consumed samples: 9297920 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815657E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:26:16.448336 | finish at 2025-09-10 11:51:50 + [2025-09-10 07:25:39] iteration 9081/ 11920 | consumed samples: 9298944 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815790E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:25:55.635886 | finish at 2025-09-10 11:51:35 + [2025-09-10 07:25:45] iteration 9082/ 11920 | consumed samples: 9299968 | elapsed time per iteration (ms): 5868.7 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825867E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:37:35.321960 | finish at 2025-09-10 12:03:21 + [2025-09-10 07:25:51] iteration 9083/ 11920 | consumed samples: 9300992 | elapsed time per iteration (ms): 5959.0 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819783E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 10.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:41:45.683314 | finish at 2025-09-10 12:07:37 + [2025-09-10 07:25:57] iteration 9084/ 11920 | consumed samples: 9302016 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808344E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:25:45.066351 | finish at 2025-09-10 11:51:42 + [2025-09-10 07:26:02] iteration 9085/ 11920 | consumed samples: 9303040 | elapsed time per iteration (ms): 5617.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814776E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:25:25.700558 | finish at 2025-09-10 11:51:28 + [2025-09-10 07:26:08] iteration 9086/ 11920 | consumed samples: 9304064 | elapsed time per iteration (ms): 5817.0 | throughput per GPU (TFLOP/s/GPU): 77.6 | MFU 7.85% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814512E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:34:45.257481 | finish at 2025-09-10 12:00:53 + [2025-09-10 07:26:14] iteration 9087/ 11920 | consumed samples: 9305088 | elapsed time per iteration (ms): 5937.2 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808790E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:40:20.105358 | finish at 2025-09-10 12:06:34 + [2025-09-10 07:26:20] iteration 9088/ 11920 | consumed samples: 9306112 | elapsed time per iteration (ms): 5980.9 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818059E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:42:17.897461 | finish at 2025-09-10 12:08:38 + [2025-09-10 07:26:26] iteration 9089/ 11920 | consumed samples: 9307136 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808912E+00 | loss scale: 1.0 | grad norm: 0.133 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:25:25.110035 | finish at 2025-09-10 11:51:51 + [2025-09-10 07:26:31] iteration 9090/ 11920 | consumed samples: 9308160 | elapsed time per iteration (ms): 5625.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812696E+00 | loss scale: 1.0 | grad norm: 0.128 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:25:19.898381 | finish at 2025-09-10 11:51:51 + [2025-09-10 07:26:37] iteration 9091/ 11920 | consumed samples: 9309184 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821222E+00 | loss scale: 1.0 | grad norm: 0.124 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:25:15.295496 | finish at 2025-09-10 11:51:52 + [2025-09-10 07:26:43] iteration 9092/ 11920 | consumed samples: 9310208 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823946E+00 | loss scale: 1.0 | grad norm: 0.129 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:24:59.216866 | finish at 2025-09-10 11:51:42 + [2025-09-10 07:26:48] iteration 9093/ 11920 | consumed samples: 9311232 | elapsed time per iteration (ms): 5618.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822578E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:24:42.423091 | finish at 2025-09-10 11:51:31 + [2025-09-10 07:26:54] iteration 9094/ 11920 | consumed samples: 9312256 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817509E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:24:48.761710 | finish at 2025-09-10 11:51:43 + [2025-09-10 07:27:00] iteration 9095/ 11920 | consumed samples: 9313280 | elapsed time per iteration (ms): 5829.5 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815255E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:34:28.257236 | finish at 2025-09-10 12:01:28 + [2025-09-10 07:27:05] iteration 9096/ 11920 | consumed samples: 9314304 | elapsed time per iteration (ms): 5617.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830911E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:24:22.950293 | finish at 2025-09-10 11:51:28 + [2025-09-10 07:27:11] iteration 9097/ 11920 | consumed samples: 9315328 | elapsed time per iteration (ms): 5979.8 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825542E+00 | loss scale: 1.0 | grad norm: 0.241 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:41:21.068242 | finish at 2025-09-10 12:08:32 + [2025-09-10 07:27:17] iteration 9098/ 11920 | consumed samples: 9316352 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822155E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:24:40.817272 | finish at 2025-09-10 11:51:58 + [2025-09-10 07:27:23] iteration 9099/ 11920 | consumed samples: 9317376 | elapsed time per iteration (ms): 5626.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825249E+00 | loss scale: 1.0 | grad norm: 0.253 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:24:31.508071 | finish at 2025-09-10 11:51:54 + [2025-09-10 07:27:28] iteration 9100/ 11920 | consumed samples: 9318400 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823021E+00 | loss scale: 1.0 | grad norm: 0.251 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:24:18.231311 | finish at 2025-09-10 11:51:46 + [2025-09-10 07:27:34] iteration 9101/ 11920 | consumed samples: 9319424 | elapsed time per iteration (ms): 5631.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822687E+00 | loss scale: 1.0 | grad norm: 0.273 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:24:34.112399 | finish at 2025-09-10 11:52:08 + [2025-09-10 07:27:39] iteration 9102/ 11920 | consumed samples: 9320448 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820420E+00 | loss scale: 1.0 | grad norm: 0.257 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:24:08.141959 | finish at 2025-09-10 11:51:48 + [2025-09-10 07:27:45] iteration 9103/ 11920 | consumed samples: 9321472 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835276E+00 | loss scale: 1.0 | grad norm: 0.229 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:23:59.026955 | finish at 2025-09-10 11:51:44 + [2025-09-10 07:27:51] iteration 9104/ 11920 | consumed samples: 9322496 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828102E+00 | loss scale: 1.0 | grad norm: 0.243 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:23:45.802856 | finish at 2025-09-10 11:51:36 + [2025-09-10 07:27:56] iteration 9105/ 11920 | consumed samples: 9323520 | elapsed time per iteration (ms): 5622.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824605E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:23:47.557476 | finish at 2025-09-10 11:51:44 + [2025-09-10 07:28:02] iteration 9106/ 11920 | consumed samples: 9324544 | elapsed time per iteration (ms): 5619.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828919E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:23:33.618299 | finish at 2025-09-10 11:51:36 + [2025-09-10 07:28:08] iteration 9107/ 11920 | consumed samples: 9325568 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816934E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:23:30.517718 | finish at 2025-09-10 11:51:38 + [2025-09-10 07:28:13] iteration 9108/ 11920 | consumed samples: 9326592 | elapsed time per iteration (ms): 5628.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827971E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:23:48.375094 | finish at 2025-09-10 11:52:02 + [2025-09-10 07:28:19] iteration 9109/ 11920 | consumed samples: 9327616 | elapsed time per iteration (ms): 5632.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818184E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:23:51.419216 | finish at 2025-09-10 11:52:10 + [2025-09-10 07:28:24] iteration 9110/ 11920 | consumed samples: 9328640 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826102E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:23:13.475945 | finish at 2025-09-10 11:51:38 + [2025-09-10 07:28:30] iteration 9111/ 11920 | consumed samples: 9329664 | elapsed time per iteration (ms): 5623.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827749E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:23:15.375752 | finish at 2025-09-10 11:51:45 + [2025-09-10 07:28:36] iteration 9112/ 11920 | consumed samples: 9330688 | elapsed time per iteration (ms): 5630.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830631E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:23:29.875832 | finish at 2025-09-10 11:52:06 + [2025-09-10 07:28:42] iteration 9113/ 11920 | consumed samples: 9331712 | elapsed time per iteration (ms): 5834.7 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813252E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:32:57.938955 | finish at 2025-09-10 12:01:39 + [2025-09-10 07:28:47] iteration 9114/ 11920 | consumed samples: 9332736 | elapsed time per iteration (ms): 5627.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812971E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:23:11.531837 | finish at 2025-09-10 11:51:59 + [2025-09-10 07:28:53] iteration 9115/ 11920 | consumed samples: 9333760 | elapsed time per iteration (ms): 5629.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832423E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:23:11.645404 | finish at 2025-09-10 11:52:04 + [2025-09-10 07:28:58] iteration 9116/ 11920 | consumed samples: 9334784 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817356E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:22:51.121500 | finish at 2025-09-10 11:51:50 + [2025-09-10 07:29:04] iteration 9117/ 11920 | consumed samples: 9335808 | elapsed time per iteration (ms): 5626.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822977E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:22:49.813459 | finish at 2025-09-10 11:51:54 + [2025-09-10 07:29:10] iteration 9118/ 11920 | consumed samples: 9336832 | elapsed time per iteration (ms): 5615.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824532E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:22:13.831939 | finish at 2025-09-10 11:51:23 + [2025-09-10 07:29:15] iteration 9119/ 11920 | consumed samples: 9337856 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811948E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:22:31.368376 | finish at 2025-09-10 11:51:47 + [2025-09-10 07:29:21] iteration 9120/ 11920 | consumed samples: 9338880 | elapsed time per iteration (ms): 5617.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830126E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:22:08.614330 | finish at 2025-09-10 11:51:29 + [2025-09-10 07:29:27] iteration 9121/ 11920 | consumed samples: 9339904 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834597E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:22:11.382682 | finish at 2025-09-10 11:51:38 + [2025-09-10 07:29:32] iteration 9122/ 11920 | consumed samples: 9340928 | elapsed time per iteration (ms): 5625.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819217E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:22:19.845370 | finish at 2025-09-10 11:51:52 + [2025-09-10 07:29:38] iteration 9123/ 11920 | consumed samples: 9341952 | elapsed time per iteration (ms): 5631.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838322E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:22:31.162142 | finish at 2025-09-10 11:52:09 + [2025-09-10 07:29:43] iteration 9124/ 11920 | consumed samples: 9342976 | elapsed time per iteration (ms): 5618.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809552E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:21:48.746693 | finish at 2025-09-10 11:51:32 + [2025-09-10 07:29:49] iteration 9125/ 11920 | consumed samples: 9344000 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829288E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:21:49.727560 | finish at 2025-09-10 11:51:39 + [2025-09-10 07:29:55] iteration 9126/ 11920 | consumed samples: 9345024 | elapsed time per iteration (ms): 5616.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816975E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:21:31.446887 | finish at 2025-09-10 11:51:26 + [2025-09-10 07:30:00] iteration 9127/ 11920 | consumed samples: 9346048 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813200E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:21:42.090786 | finish at 2025-09-10 11:51:42 + [2025-09-10 07:30:06] iteration 9128/ 11920 | consumed samples: 9347072 | elapsed time per iteration (ms): 6089.5 | throughput per GPU (TFLOP/s/GPU): 74.1 | MFU 7.50% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830712E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:43:21.953756 | finish at 2025-09-10 12:13:28 + [2025-09-10 07:30:12] iteration 9129/ 11920 | consumed samples: 9348096 | elapsed time per iteration (ms): 5614.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823221E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:21:10.100903 | finish at 2025-09-10 11:51:22 + [2025-09-10 07:30:18] iteration 9130/ 11920 | consumed samples: 9349120 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827826E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:21:32.338471 | finish at 2025-09-10 11:51:50 + [2025-09-10 07:30:23] iteration 9131/ 11920 | consumed samples: 9350144 | elapsed time per iteration (ms): 5916.6 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830640E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:35:01.374761 | finish at 2025-09-10 12:05:25 + [2025-09-10 07:30:29] iteration 9132/ 11920 | consumed samples: 9351168 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814976E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:21:07.898958 | finish at 2025-09-10 11:51:37 + [2025-09-10 07:30:35] iteration 9133/ 11920 | consumed samples: 9352192 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835382E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:21:19.153476 | finish at 2025-09-10 11:51:54 + [2025-09-10 07:30:40] iteration 9134/ 11920 | consumed samples: 9353216 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819439E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:21:00.911195 | finish at 2025-09-10 11:51:41 + [2025-09-10 07:30:46] iteration 9135/ 11920 | consumed samples: 9354240 | elapsed time per iteration (ms): 5615.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825529E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:20:39.928365 | finish at 2025-09-10 11:51:26 + [2025-09-10 07:30:52] iteration 9136/ 11920 | consumed samples: 9355264 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817715E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:20:50.281929 | finish at 2025-09-10 11:51:42 + [2025-09-10 07:30:57] iteration 9137/ 11920 | consumed samples: 9356288 | elapsed time per iteration (ms): 5861.8 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820205E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:31:53.265198 | finish at 2025-09-10 12:02:51 + [2025-09-10 07:31:03] iteration 9138/ 11920 | consumed samples: 9357312 | elapsed time per iteration (ms): 5614.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822915E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:20:20.790074 | finish at 2025-09-10 11:51:24 + [2025-09-10 07:31:09] iteration 9139/ 11920 | consumed samples: 9358336 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832783E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:20:34.386769 | finish at 2025-09-10 11:51:43 + [2025-09-10 07:31:14] iteration 9140/ 11920 | consumed samples: 9359360 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813115E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:20:35.585160 | finish at 2025-09-10 11:51:50 + [2025-09-10 07:31:20] iteration 9141/ 11920 | consumed samples: 9360384 | elapsed time per iteration (ms): 5628.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815463E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:20:42.806664 | finish at 2025-09-10 11:52:03 + [2025-09-10 07:31:26] iteration 9142/ 11920 | consumed samples: 9361408 | elapsed time per iteration (ms): 5925.8 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823725E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:34:21.978609 | finish at 2025-09-10 12:05:48 + [2025-09-10 07:31:31] iteration 9143/ 11920 | consumed samples: 9362432 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832150E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:20:15.846046 | finish at 2025-09-10 11:51:47 + [2025-09-10 07:31:37] iteration 9144/ 11920 | consumed samples: 9363456 | elapsed time per iteration (ms): 5943.6 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818132E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:34:59.439388 | finish at 2025-09-10 12:06:37 + [2025-09-10 07:31:43] iteration 9145/ 11920 | consumed samples: 9364480 | elapsed time per iteration (ms): 5628.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819135E+00 | loss scale: 1.0 | grad norm: 0.241 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:20:17.723876 | finish at 2025-09-10 11:52:01 + [2025-09-10 07:31:49] iteration 9146/ 11920 | consumed samples: 9365504 | elapsed time per iteration (ms): 5618.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822651E+00 | loss scale: 1.0 | grad norm: 0.248 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:19:46.594642 | finish at 2025-09-10 11:51:35 + [2025-09-10 07:31:54] iteration 9147/ 11920 | consumed samples: 9366528 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815716E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 10.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:19:54.897340 | finish at 2025-09-10 11:51:49 + [2025-09-10 07:32:00] iteration 9148/ 11920 | consumed samples: 9367552 | elapsed time per iteration (ms): 5932.3 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830602E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:34:04.278354 | finish at 2025-09-10 12:06:05 + [2025-09-10 07:32:06] iteration 9149/ 11920 | consumed samples: 9368576 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825546E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:19:37.882125 | finish at 2025-09-10 11:51:44 + [2025-09-10 07:32:12] iteration 9150/ 11920 | consumed samples: 9369600 | elapsed time per iteration (ms): 5998.9 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810790E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:36:56.929710 | finish at 2025-09-10 12:09:09 + [2025-09-10 07:32:17] iteration 9151/ 11920 | consumed samples: 9370624 | elapsed time per iteration (ms): 5627.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820965E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:19:42.724587 | finish at 2025-09-10 11:52:00 + [2025-09-10 07:32:23] iteration 9152/ 11920 | consumed samples: 9371648 | elapsed time per iteration (ms): 5626.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808755E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:19:33.441601 | finish at 2025-09-10 11:51:57 + [2025-09-10 07:32:29] iteration 9153/ 11920 | consumed samples: 9372672 | elapsed time per iteration (ms): 5889.4 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810398E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:31:36.072069 | finish at 2025-09-10 12:04:05 + [2025-09-10 07:32:35] iteration 9154/ 11920 | consumed samples: 9373696 | elapsed time per iteration (ms): 5868.0 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812046E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:30:30.988323 | finish at 2025-09-10 12:03:06 + [2025-09-10 07:32:41] iteration 9155/ 11920 | consumed samples: 9374720 | elapsed time per iteration (ms): 5635.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820960E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:19:42.318225 | finish at 2025-09-10 11:52:23 + [2025-09-10 07:32:46] iteration 9156/ 11920 | consumed samples: 9375744 | elapsed time per iteration (ms): 5618.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816296E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:18:48.368233 | finish at 2025-09-10 11:51:34 + [2025-09-10 07:32:52] iteration 9157/ 11920 | consumed samples: 9376768 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812179E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:19:01.623357 | finish at 2025-09-10 11:51:53 + [2025-09-10 07:32:57] iteration 9158/ 11920 | consumed samples: 9377792 | elapsed time per iteration (ms): 5628.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812971E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:19:05.754964 | finish at 2025-09-10 11:52:03 + [2025-09-10 07:33:03] iteration 9159/ 11920 | consumed samples: 9378816 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811686E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:18:36.354285 | finish at 2025-09-10 11:51:39 + [2025-09-10 07:33:09] iteration 9160/ 11920 | consumed samples: 9379840 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812985E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:18:33.779182 | finish at 2025-09-10 11:51:42 + [2025-09-10 07:33:14] iteration 9161/ 11920 | consumed samples: 9380864 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819387E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:18:41.326025 | finish at 2025-09-10 11:51:56 + [2025-09-10 07:33:20] iteration 9162/ 11920 | consumed samples: 9381888 | elapsed time per iteration (ms): 5631.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837368E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:18:50.810354 | finish at 2025-09-10 11:52:11 + [2025-09-10 07:33:25] iteration 9163/ 11920 | consumed samples: 9382912 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831268E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:18:22.230811 | finish at 2025-09-10 11:51:48 + [2025-09-10 07:33:31] iteration 9164/ 11920 | consumed samples: 9383936 | elapsed time per iteration (ms): 5840.4 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819874E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:28:16.176515 | finish at 2025-09-10 12:01:48 + [2025-09-10 07:33:37] iteration 9165/ 11920 | consumed samples: 9384960 | elapsed time per iteration (ms): 5637.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829797E+00 | loss scale: 1.0 | grad norm: 0.244 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:18:50.073825 | finish at 2025-09-10 11:52:27 + [2025-09-10 07:33:43] iteration 9166/ 11920 | consumed samples: 9385984 | elapsed time per iteration (ms): 5618.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822133E+00 | loss scale: 1.0 | grad norm: 0.287 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:17:53.417928 | finish at 2025-09-10 11:51:36 + [2025-09-10 07:33:49] iteration 9167/ 11920 | consumed samples: 9387008 | elapsed time per iteration (ms): 5953.8 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834763E+00 | loss scale: 1.0 | grad norm: 0.285 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:33:10.805391 | finish at 2025-09-10 12:06:59 + [2025-09-10 07:33:54] iteration 9168/ 11920 | consumed samples: 9388032 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839403E+00 | loss scale: 1.0 | grad norm: 0.273 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:17:56.191833 | finish at 2025-09-10 11:51:50 + [2025-09-10 07:34:00] iteration 9169/ 11920 | consumed samples: 9389056 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823101E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:17:54.150686 | finish at 2025-09-10 11:51:54 + [2025-09-10 07:34:06] iteration 9170/ 11920 | consumed samples: 9390080 | elapsed time per iteration (ms): 5832.6 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818286E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:27:19.532959 | finish at 2025-09-10 12:01:25 + [2025-09-10 07:34:12] iteration 9171/ 11920 | consumed samples: 9391104 | elapsed time per iteration (ms): 6231.4 | throughput per GPU (TFLOP/s/GPU): 72.5 | MFU 7.33% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840677E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:45:30.200562 | finish at 2025-09-10 12:19:42 + [2025-09-10 07:34:18] iteration 9172/ 11920 | consumed samples: 9392128 | elapsed time per iteration (ms): 5990.1 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828783E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:34:20.772923 | finish at 2025-09-10 12:08:39 + [2025-09-10 07:34:23] iteration 9173/ 11920 | consumed samples: 9393152 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824831E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:17:34.150247 | finish at 2025-09-10 11:51:58 + [2025-09-10 07:34:29] iteration 9174/ 11920 | consumed samples: 9394176 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819105E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:17:22.176473 | finish at 2025-09-10 11:51:51 + [2025-09-10 07:34:35] iteration 9175/ 11920 | consumed samples: 9395200 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838212E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:17:22.731049 | finish at 2025-09-10 11:51:57 + [2025-09-10 07:34:40] iteration 9176/ 11920 | consumed samples: 9396224 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827227E+00 | loss scale: 1.0 | grad norm: 0.242 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:17:12.412558 | finish at 2025-09-10 11:51:53 + [2025-09-10 07:34:46] iteration 9177/ 11920 | consumed samples: 9397248 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824651E+00 | loss scale: 1.0 | grad norm: 0.250 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:17:06.031843 | finish at 2025-09-10 11:51:52 + [2025-09-10 07:34:52] iteration 9178/ 11920 | consumed samples: 9398272 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816052E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:16:56.069165 | finish at 2025-09-10 11:51:48 + [2025-09-10 07:34:58] iteration 9179/ 11920 | consumed samples: 9399296 | elapsed time per iteration (ms): 5970.9 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813218E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:32:46.138382 | finish at 2025-09-10 12:07:44 + [2025-09-10 07:35:03] iteration 9180/ 11920 | consumed samples: 9400320 | elapsed time per iteration (ms): 5886.2 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818621E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:28:48.084650 | finish at 2025-09-10 12:03:52 + [2025-09-10 07:35:09] iteration 9181/ 11920 | consumed samples: 9401344 | elapsed time per iteration (ms): 5618.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817717E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:16:29.480280 | finish at 2025-09-10 11:51:39 + [2025-09-10 07:35:15] iteration 9182/ 11920 | consumed samples: 9402368 | elapsed time per iteration (ms): 5622.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832468E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:16:34.645070 | finish at 2025-09-10 11:51:49 + [2025-09-10 07:35:20] iteration 9183/ 11920 | consumed samples: 9403392 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817938E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:16:32.269579 | finish at 2025-09-10 11:51:53 + [2025-09-10 07:35:26] iteration 9184/ 11920 | consumed samples: 9404416 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819864E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:16:17.756081 | finish at 2025-09-10 11:51:44 + [2025-09-10 07:35:32] iteration 9185/ 11920 | consumed samples: 9405440 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823373E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:16:13.028246 | finish at 2025-09-10 11:51:45 + [2025-09-10 07:35:37] iteration 9186/ 11920 | consumed samples: 9406464 | elapsed time per iteration (ms): 5628.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818875E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:16:27.590205 | finish at 2025-09-10 11:52:05 + [2025-09-10 07:35:43] iteration 9187/ 11920 | consumed samples: 9407488 | elapsed time per iteration (ms): 5857.9 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823582E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:26:49.630013 | finish at 2025-09-10 12:02:33 + [2025-09-10 07:35:49] iteration 9188/ 11920 | consumed samples: 9408512 | elapsed time per iteration (ms): 5836.1 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818135E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:25:44.346630 | finish at 2025-09-10 12:01:33 + [2025-09-10 07:35:55] iteration 9189/ 11920 | consumed samples: 9409536 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823780E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:15:49.035542 | finish at 2025-09-10 11:51:44 + [2025-09-10 07:36:00] iteration 9190/ 11920 | consumed samples: 9410560 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815098E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:15:52.240562 | finish at 2025-09-10 11:51:52 + [2025-09-10 07:36:06] iteration 9191/ 11920 | consumed samples: 9411584 | elapsed time per iteration (ms): 5628.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815472E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:15:58.948042 | finish at 2025-09-10 11:52:05 + [2025-09-10 07:36:11] iteration 9192/ 11920 | consumed samples: 9412608 | elapsed time per iteration (ms): 5618.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810608E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:15:28.194162 | finish at 2025-09-10 11:51:40 + [2025-09-10 07:36:17] iteration 9193/ 11920 | consumed samples: 9413632 | elapsed time per iteration (ms): 5876.3 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821390E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:27:04.725627 | finish at 2025-09-10 12:03:22 + [2025-09-10 07:36:23] iteration 9194/ 11920 | consumed samples: 9414656 | elapsed time per iteration (ms): 5942.4 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828099E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:29:59.074632 | finish at 2025-09-10 12:06:22 + [2025-09-10 07:36:29] iteration 9195/ 11920 | consumed samples: 9415680 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810627E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:15:13.023591 | finish at 2025-09-10 11:51:42 + [2025-09-10 07:36:35] iteration 9196/ 11920 | consumed samples: 9416704 | elapsed time per iteration (ms): 5836.7 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811133E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:24:59.171348 | finish at 2025-09-10 12:01:34 + [2025-09-10 07:36:40] iteration 9197/ 11920 | consumed samples: 9417728 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814349E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:15:06.723893 | finish at 2025-09-10 11:51:47 + [2025-09-10 07:36:46] iteration 9198/ 11920 | consumed samples: 9418752 | elapsed time per iteration (ms): 5822.0 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.84% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802894E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:24:07.421771 | finish at 2025-09-10 12:00:54 + [2025-09-10 07:36:52] iteration 9199/ 11920 | consumed samples: 9419776 | elapsed time per iteration (ms): 5850.8 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829750E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:25:19.940620 | finish at 2025-09-10 12:02:12 + [2025-09-10 07:36:58] iteration 9200/ 11920 | consumed samples: 9420800 | elapsed time per iteration (ms): 5628.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822184E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:15:10.162621 | finish at 2025-09-10 11:52:08 + [2025-09-10 07:37:03] iteration 9201/ 11920 | consumed samples: 9421824 | elapsed time per iteration (ms): 5622.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824858E+00 | loss scale: 1.0 | grad norm: 0.229 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:14:47.694679 | finish at 2025-09-10 11:51:51 + [2025-09-10 07:37:09] iteration 9202/ 11920 | consumed samples: 9422848 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827850E+00 | loss scale: 1.0 | grad norm: 0.245 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:14:43.889837 | finish at 2025-09-10 11:51:53 + [2025-09-10 07:37:14] iteration 9203/ 11920 | consumed samples: 9423872 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829330E+00 | loss scale: 1.0 | grad norm: 0.257 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:14:49.258212 | finish at 2025-09-10 11:52:04 + [2025-09-10 07:37:20] iteration 9204/ 11920 | consumed samples: 9424896 | elapsed time per iteration (ms): 5633.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832686E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:15:01.207265 | finish at 2025-09-10 11:52:21 + [2025-09-10 07:37:26] iteration 9205/ 11920 | consumed samples: 9425920 | elapsed time per iteration (ms): 5628.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825409E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:14:41.658390 | finish at 2025-09-10 11:52:07 + [2025-09-10 07:37:31] iteration 9206/ 11920 | consumed samples: 9426944 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825268E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:14:20.067912 | finish at 2025-09-10 11:51:51 + [2025-09-10 07:37:37] iteration 9207/ 11920 | consumed samples: 9427968 | elapsed time per iteration (ms): 5850.8 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816517E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:24:33.296827 | finish at 2025-09-10 12:02:10 + [2025-09-10 07:37:43] iteration 9208/ 11920 | consumed samples: 9428992 | elapsed time per iteration (ms): 6172.9 | throughput per GPU (TFLOP/s/GPU): 73.1 | MFU 7.40% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818342E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:39:00.968708 | finish at 2025-09-10 12:16:44 + [2025-09-10 07:37:49] iteration 9209/ 11920 | consumed samples: 9430016 | elapsed time per iteration (ms): 5619.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826175E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:13:54.876661 | finish at 2025-09-10 11:51:44 + [2025-09-10 07:37:55] iteration 9210/ 11920 | consumed samples: 9431040 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838895E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:13:59.643297 | finish at 2025-09-10 11:51:54 + [2025-09-10 07:38:00] iteration 9211/ 11920 | consumed samples: 9432064 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823100E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:13:54.209054 | finish at 2025-09-10 11:51:54 + [2025-09-10 07:38:06] iteration 9212/ 11920 | consumed samples: 9433088 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.833220E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:13:45.088081 | finish at 2025-09-10 11:51:51 + [2025-09-10 07:38:11] iteration 9213/ 11920 | consumed samples: 9434112 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804803E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:13:44.678061 | finish at 2025-09-10 11:51:56 + [2025-09-10 07:38:17] iteration 9214/ 11920 | consumed samples: 9435136 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822886E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:13:37.202907 | finish at 2025-09-10 11:51:54 + [2025-09-10 07:38:23] iteration 9215/ 11920 | consumed samples: 9436160 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811107E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:13:31.504592 | finish at 2025-09-10 11:51:54 + [2025-09-10 07:38:28] iteration 9216/ 11920 | consumed samples: 9437184 | elapsed time per iteration (ms): 5619.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811509E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:13:13.821014 | finish at 2025-09-10 11:51:42 + [2025-09-10 07:38:34] iteration 9217/ 11920 | consumed samples: 9438208 | elapsed time per iteration (ms): 5625.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804685E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:13:26.856115 | finish at 2025-09-10 11:52:01 + [2025-09-10 07:38:40] iteration 9218/ 11920 | consumed samples: 9439232 | elapsed time per iteration (ms): 5629.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812166E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:13:29.946962 | finish at 2025-09-10 11:52:10 + [2025-09-10 07:38:45] iteration 9219/ 11920 | consumed samples: 9440256 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818447E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:13:10.366883 | finish at 2025-09-10 11:51:56 + [2025-09-10 07:38:51] iteration 9220/ 11920 | consumed samples: 9441280 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839113E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:12:53.654652 | finish at 2025-09-10 11:51:44 + [2025-09-10 07:38:56] iteration 9221/ 11920 | consumed samples: 9442304 | elapsed time per iteration (ms): 5630.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802909E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:13:16.374799 | finish at 2025-09-10 11:52:13 + [2025-09-10 07:39:02] iteration 9222/ 11920 | consumed samples: 9443328 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814795E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:12:53.067183 | finish at 2025-09-10 11:51:55 + [2025-09-10 07:39:08] iteration 9223/ 11920 | consumed samples: 9444352 | elapsed time per iteration (ms): 6005.0 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814519E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:29:55.406861 | finish at 2025-09-10 12:09:04 + [2025-09-10 07:39:14] iteration 9224/ 11920 | consumed samples: 9445376 | elapsed time per iteration (ms): 5619.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831432E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:12:30.477108 | finish at 2025-09-10 11:51:44 + [2025-09-10 07:39:19] iteration 9225/ 11920 | consumed samples: 9446400 | elapsed time per iteration (ms): 5628.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811019E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:12:49.421725 | finish at 2025-09-10 11:52:09 + [2025-09-10 07:39:25] iteration 9226/ 11920 | consumed samples: 9447424 | elapsed time per iteration (ms): 5618.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809474E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:12:15.533740 | finish at 2025-09-10 11:51:41 + [2025-09-10 07:39:31] iteration 9227/ 11920 | consumed samples: 9448448 | elapsed time per iteration (ms): 5618.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830139E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:12:10.088858 | finish at 2025-09-10 11:51:41 + [2025-09-10 07:39:36] iteration 9228/ 11920 | consumed samples: 9449472 | elapsed time per iteration (ms): 5638.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813176E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:12:58.836799 | finish at 2025-09-10 11:52:35 + [2025-09-10 07:39:42] iteration 9229/ 11920 | consumed samples: 9450496 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827771E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:12:08.348343 | finish at 2025-09-10 11:51:50 + [2025-09-10 07:39:47] iteration 9230/ 11920 | consumed samples: 9451520 | elapsed time per iteration (ms): 5619.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817606E+00 | loss scale: 1.0 | grad norm: 0.136 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:11:55.542796 | finish at 2025-09-10 11:51:43 + [2025-09-10 07:39:53] iteration 9231/ 11920 | consumed samples: 9452544 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824454E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:11:52.101477 | finish at 2025-09-10 11:51:45 + [2025-09-10 07:39:59] iteration 9232/ 11920 | consumed samples: 9453568 | elapsed time per iteration (ms): 5962.0 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808893E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:27:05.734589 | finish at 2025-09-10 12:07:05 + [2025-09-10 07:40:05] iteration 9233/ 11920 | consumed samples: 9454592 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822212E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:11:47.034653 | finish at 2025-09-10 11:51:52 + [2025-09-10 07:40:10] iteration 9234/ 11920 | consumed samples: 9455616 | elapsed time per iteration (ms): 5621.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823773E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:11:37.917764 | finish at 2025-09-10 11:51:48 + [2025-09-10 07:40:16] iteration 9235/ 11920 | consumed samples: 9456640 | elapsed time per iteration (ms): 5621.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829708E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:11:32.422907 | finish at 2025-09-10 11:51:48 + [2025-09-10 07:40:22] iteration 9236/ 11920 | consumed samples: 9457664 | elapsed time per iteration (ms): 5627.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828989E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:11:44.933898 | finish at 2025-09-10 11:52:06 + [2025-09-10 07:40:27] iteration 9237/ 11920 | consumed samples: 9458688 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826754E+00 | loss scale: 1.0 | grad norm: 0.251 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:11:23.344267 | finish at 2025-09-10 11:51:51 + [2025-09-10 07:40:33] iteration 9238/ 11920 | consumed samples: 9459712 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825275E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:11:24.577229 | finish at 2025-09-10 11:51:57 + [2025-09-10 07:40:39] iteration 9239/ 11920 | consumed samples: 9460736 | elapsed time per iteration (ms): 6002.0 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814784E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:28:11.326455 | finish at 2025-09-10 12:08:50 + [2025-09-10 07:40:44] iteration 9240/ 11920 | consumed samples: 9461760 | elapsed time per iteration (ms): 5617.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827933E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:10:53.878479 | finish at 2025-09-10 11:51:38 + [2025-09-10 07:40:50] iteration 9241/ 11920 | consumed samples: 9462784 | elapsed time per iteration (ms): 5619.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822611E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:10:53.780569 | finish at 2025-09-10 11:51:44 + [2025-09-10 07:40:56] iteration 9242/ 11920 | consumed samples: 9463808 | elapsed time per iteration (ms): 5636.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827229E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:11:33.846904 | finish at 2025-09-10 11:52:30 + [2025-09-10 07:41:01] iteration 9243/ 11920 | consumed samples: 9464832 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834223E+00 | loss scale: 1.0 | grad norm: 0.242 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:10:45.945980 | finish at 2025-09-10 11:51:47 + [2025-09-10 07:41:07] iteration 9244/ 11920 | consumed samples: 9465856 | elapsed time per iteration (ms): 5635.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823232E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:11:20.790556 | finish at 2025-09-10 11:52:28 + [2025-09-10 07:41:13] iteration 9245/ 11920 | consumed samples: 9466880 | elapsed time per iteration (ms): 5836.6 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832222E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:20:12.817800 | finish at 2025-09-10 12:01:26 + [2025-09-10 07:41:18] iteration 9246/ 11920 | consumed samples: 9467904 | elapsed time per iteration (ms): 5635.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821750E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:11:08.377594 | finish at 2025-09-10 11:52:27 + [2025-09-10 07:41:24] iteration 9247/ 11920 | consumed samples: 9468928 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838361E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:10:29.511450 | finish at 2025-09-10 11:51:54 + [2025-09-10 07:41:30] iteration 9248/ 11920 | consumed samples: 9469952 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824420E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:10:26.777142 | finish at 2025-09-10 11:51:56 + [2025-09-10 07:41:36] iteration 9249/ 11920 | consumed samples: 9470976 | elapsed time per iteration (ms): 5877.7 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827247E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:21:39.452205 | finish at 2025-09-10 12:03:15 + [2025-09-10 07:41:41] iteration 9250/ 11920 | consumed samples: 9472000 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825478E+00 | loss scale: 1.0 | grad norm: 0.242 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:10:14.587419 | finish at 2025-09-10 11:51:56 + [2025-09-10 07:41:47] iteration 9251/ 11920 | consumed samples: 9473024 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819530E+00 | loss scale: 1.0 | grad norm: 0.260 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:10:07.637211 | finish at 2025-09-10 11:51:54 + [2025-09-10 07:41:52] iteration 9252/ 11920 | consumed samples: 9474048 | elapsed time per iteration (ms): 5618.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824748E+00 | loss scale: 1.0 | grad norm: 0.260 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:09:50.371715 | finish at 2025-09-10 11:51:43 + [2025-09-10 07:41:58] iteration 9253/ 11920 | consumed samples: 9475072 | elapsed time per iteration (ms): 5628.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838919E+00 | loss scale: 1.0 | grad norm: 0.281 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:10:10.728747 | finish at 2025-09-10 11:52:09 + [2025-09-10 07:42:04] iteration 9254/ 11920 | consumed samples: 9476096 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813783E+00 | loss scale: 1.0 | grad norm: 0.270 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:09:44.339679 | finish at 2025-09-10 11:51:48 + [2025-09-10 07:42:09] iteration 9255/ 11920 | consumed samples: 9477120 | elapsed time per iteration (ms): 5619.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825558E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:09:35.542219 | finish at 2025-09-10 11:51:45 + [2025-09-10 07:42:15] iteration 9256/ 11920 | consumed samples: 9478144 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820335E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:09:32.408844 | finish at 2025-09-10 11:51:47 + [2025-09-10 07:42:20] iteration 9257/ 11920 | consumed samples: 9479168 | elapsed time per iteration (ms): 5627.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821533E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:09:45.117114 | finish at 2025-09-10 11:52:06 + [2025-09-10 07:42:26] iteration 9258/ 11920 | consumed samples: 9480192 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828717E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:09:30.384344 | finish at 2025-09-10 11:51:57 + [2025-09-10 07:42:32] iteration 9259/ 11920 | consumed samples: 9481216 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811725E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:09:21.291535 | finish at 2025-09-10 11:51:53 + [2025-09-10 07:42:38] iteration 9260/ 11920 | consumed samples: 9482240 | elapsed time per iteration (ms): 5939.0 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818172E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:23:17.643948 | finish at 2025-09-10 12:05:55 + [2025-09-10 07:42:44] iteration 9261/ 11920 | consumed samples: 9483264 | elapsed time per iteration (ms): 5833.6 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822095E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:18:31.660433 | finish at 2025-09-10 12:01:15 + [2025-09-10 07:42:49] iteration 9262/ 11920 | consumed samples: 9484288 | elapsed time per iteration (ms): 5617.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822482E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:08:52.509733 | finish at 2025-09-10 11:51:42 + [2025-09-10 07:42:55] iteration 9263/ 11920 | consumed samples: 9485312 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819564E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:08:52.793900 | finish at 2025-09-10 11:51:48 + [2025-09-10 07:43:01] iteration 9264/ 11920 | consumed samples: 9486336 | elapsed time per iteration (ms): 6142.8 | throughput per GPU (TFLOP/s/GPU): 73.5 | MFU 7.43% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823603E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:31:55.279579 | finish at 2025-09-10 12:14:56 + [2025-09-10 07:43:07] iteration 9265/ 11920 | consumed samples: 9487360 | elapsed time per iteration (ms): 5619.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820265E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:08:38.831019 | finish at 2025-09-10 11:51:45 + [2025-09-10 07:43:13] iteration 9266/ 11920 | consumed samples: 9488384 | elapsed time per iteration (ms): 5990.2 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812571E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 10.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:24:57.897264 | finish at 2025-09-10 12:08:10 + [2025-09-10 07:43:18] iteration 9267/ 11920 | consumed samples: 9489408 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817867E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:08:40.149605 | finish at 2025-09-10 11:51:58 + [2025-09-10 07:43:24] iteration 9268/ 11920 | consumed samples: 9490432 | elapsed time per iteration (ms): 5628.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817972E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:08:46.742126 | finish at 2025-09-10 11:52:11 + [2025-09-10 07:43:30] iteration 9269/ 11920 | consumed samples: 9491456 | elapsed time per iteration (ms): 5889.2 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835197E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:20:12.230571 | finish at 2025-09-10 12:03:42 + [2025-09-10 07:43:35] iteration 9270/ 11920 | consumed samples: 9492480 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818249E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:08:28.004534 | finish at 2025-09-10 11:52:03 + [2025-09-10 07:43:41] iteration 9271/ 11920 | consumed samples: 9493504 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818574E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:08:11.796272 | finish at 2025-09-10 11:51:53 + [2025-09-10 07:43:47] iteration 9272/ 11920 | consumed samples: 9494528 | elapsed time per iteration (ms): 5616.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817942E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:07:51.213289 | finish at 2025-09-10 11:51:38 + [2025-09-10 07:43:52] iteration 9273/ 11920 | consumed samples: 9495552 | elapsed time per iteration (ms): 5619.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815034E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:07:55.060526 | finish at 2025-09-10 11:51:47 + [2025-09-10 07:43:58] iteration 9274/ 11920 | consumed samples: 9496576 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824549E+00 | loss scale: 1.0 | grad norm: 0.248 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:08:15.200659 | finish at 2025-09-10 11:52:13 + [2025-09-10 07:44:03] iteration 9275/ 11920 | consumed samples: 9497600 | elapsed time per iteration (ms): 5627.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819613E+00 | loss scale: 1.0 | grad norm: 0.308 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:08:05.786998 | finish at 2025-09-10 11:52:09 + [2025-09-10 07:44:09] iteration 9276/ 11920 | consumed samples: 9498624 | elapsed time per iteration (ms): 5628.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834844E+00 | loss scale: 1.0 | grad norm: 0.276 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:08:00.892232 | finish at 2025-09-10 11:52:10 + [2025-09-10 07:44:15] iteration 9277/ 11920 | consumed samples: 9499648 | elapsed time per iteration (ms): 5632.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822284E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:08:05.800634 | finish at 2025-09-10 11:52:20 + [2025-09-10 07:44:20] iteration 9278/ 11920 | consumed samples: 9500672 | elapsed time per iteration (ms): 5626.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806257E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:07:45.055867 | finish at 2025-09-10 11:52:05 + [2025-09-10 07:44:26] iteration 9279/ 11920 | consumed samples: 9501696 | elapsed time per iteration (ms): 5618.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828249E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:07:17.348388 | finish at 2025-09-10 11:51:43 + [2025-09-10 07:44:32] iteration 9280/ 11920 | consumed samples: 9502720 | elapsed time per iteration (ms): 5627.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816941E+00 | loss scale: 1.0 | grad norm: 0.251 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:07:35.880718 | finish at 2025-09-10 11:52:07 + [2025-09-10 07:44:37] iteration 9281/ 11920 | consumed samples: 9503744 | elapsed time per iteration (ms): 5627.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835965E+00 | loss scale: 1.0 | grad norm: 0.250 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:07:31.350163 | finish at 2025-09-10 11:52:09 + [2025-09-10 07:44:43] iteration 9282/ 11920 | consumed samples: 9504768 | elapsed time per iteration (ms): 5954.0 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826424E+00 | loss scale: 1.0 | grad norm: 0.264 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:21:46.589095 | finish at 2025-09-10 12:06:30 + [2025-09-10 07:44:49] iteration 9283/ 11920 | consumed samples: 9505792 | elapsed time per iteration (ms): 5631.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805071E+00 | loss scale: 1.0 | grad norm: 0.242 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:07:29.123149 | finish at 2025-09-10 11:52:18 + [2025-09-10 07:44:54] iteration 9284/ 11920 | consumed samples: 9506816 | elapsed time per iteration (ms): 5629.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820175E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:07:19.773417 | finish at 2025-09-10 11:52:14 + [2025-09-10 07:45:00] iteration 9285/ 11920 | consumed samples: 9507840 | elapsed time per iteration (ms): 5637.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825968E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:07:33.960114 | finish at 2025-09-10 11:52:34 + [2025-09-10 07:45:06] iteration 9286/ 11920 | consumed samples: 9508864 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812987E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:06:45.976637 | finish at 2025-09-10 11:51:52 + [2025-09-10 07:45:11] iteration 9287/ 11920 | consumed samples: 9509888 | elapsed time per iteration (ms): 5619.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832721E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:06:35.729603 | finish at 2025-09-10 11:51:47 + [2025-09-10 07:45:17] iteration 9288/ 11920 | consumed samples: 9510912 | elapsed time per iteration (ms): 5619.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812522E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:06:29.576242 | finish at 2025-09-10 11:51:46 + [2025-09-10 07:45:22] iteration 9289/ 11920 | consumed samples: 9511936 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822093E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:06:37.093585 | finish at 2025-09-10 11:52:00 + [2025-09-10 07:45:28] iteration 9290/ 11920 | consumed samples: 9512960 | elapsed time per iteration (ms): 5977.0 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815744E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:21:59.617846 | finish at 2025-09-10 12:07:28 + [2025-09-10 07:45:34] iteration 9291/ 11920 | consumed samples: 9513984 | elapsed time per iteration (ms): 5626.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810470E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:06:31.744784 | finish at 2025-09-10 11:52:06 + [2025-09-10 07:45:40] iteration 9292/ 11920 | consumed samples: 9515008 | elapsed time per iteration (ms): 5628.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808993E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:06:31.374653 | finish at 2025-09-10 11:52:11 + [2025-09-10 07:45:46] iteration 9293/ 11920 | consumed samples: 9516032 | elapsed time per iteration (ms): 5891.3 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819683E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:17:56.518909 | finish at 2025-09-10 12:03:42 + [2025-09-10 07:45:52] iteration 9294/ 11920 | consumed samples: 9517056 | elapsed time per iteration (ms): 5947.2 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818523E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:20:17.398070 | finish at 2025-09-10 12:06:09 + [2025-09-10 07:45:58] iteration 9295/ 11920 | consumed samples: 9518080 | elapsed time per iteration (ms): 5943.5 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821991E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:20:01.566195 | finish at 2025-09-10 12:05:59 + [2025-09-10 07:46:03] iteration 9296/ 11920 | consumed samples: 9519104 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827600E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:05:49.834457 | finish at 2025-09-10 11:51:53 + [2025-09-10 07:46:09] iteration 9297/ 11920 | consumed samples: 9520128 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820838E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:05:40.871969 | finish at 2025-09-10 11:51:50 + [2025-09-10 07:46:14] iteration 9298/ 11920 | consumed samples: 9521152 | elapsed time per iteration (ms): 5618.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826940E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:05:32.629682 | finish at 2025-09-10 11:51:47 + [2025-09-10 07:46:20] iteration 9299/ 11920 | consumed samples: 9522176 | elapsed time per iteration (ms): 5849.6 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826798E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:15:31.826172 | finish at 2025-09-10 12:01:52 + [2025-09-10 07:46:26] iteration 9300/ 11920 | consumed samples: 9523200 | elapsed time per iteration (ms): 5631.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812367E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:05:54.358859 | finish at 2025-09-10 11:52:20 + [2025-09-10 07:46:31] iteration 9301/ 11920 | consumed samples: 9524224 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815425E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:05:23.954879 | finish at 2025-09-10 11:51:55 + [2025-09-10 07:46:37] iteration 9302/ 11920 | consumed samples: 9525248 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816208E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:05:29.588738 | finish at 2025-09-10 11:52:07 + [2025-09-10 07:46:43] iteration 9303/ 11920 | consumed samples: 9526272 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801594E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:05:16.809598 | finish at 2025-09-10 11:52:00 + [2025-09-10 07:46:48] iteration 9304/ 11920 | consumed samples: 9527296 | elapsed time per iteration (ms): 5617.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803215E+00 | loss scale: 1.0 | grad norm: 0.125 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:04:56.287663 | finish at 2025-09-10 11:51:45 + [2025-09-10 07:46:54] iteration 9305/ 11920 | consumed samples: 9528320 | elapsed time per iteration (ms): 5618.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816333E+00 | loss scale: 1.0 | grad norm: 0.130 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:04:53.525907 | finish at 2025-09-10 11:51:47 + [2025-09-10 07:47:00] iteration 9306/ 11920 | consumed samples: 9529344 | elapsed time per iteration (ms): 5626.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808648E+00 | loss scale: 1.0 | grad norm: 0.135 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:05:08.049014 | finish at 2025-09-10 11:52:08 + [2025-09-10 07:47:05] iteration 9307/ 11920 | consumed samples: 9530368 | elapsed time per iteration (ms): 5616.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815758E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:04:35.487494 | finish at 2025-09-10 11:51:41 + [2025-09-10 07:47:11] iteration 9308/ 11920 | consumed samples: 9531392 | elapsed time per iteration (ms): 5838.0 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804465E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:14:08.982573 | finish at 2025-09-10 12:01:20 + [2025-09-10 07:47:17] iteration 9309/ 11920 | consumed samples: 9532416 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812793E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:04:36.678894 | finish at 2025-09-10 11:51:53 + [2025-09-10 07:47:22] iteration 9310/ 11920 | consumed samples: 9533440 | elapsed time per iteration (ms): 5619.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819081E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:04:26.088331 | finish at 2025-09-10 11:51:48 + [2025-09-10 07:47:28] iteration 9311/ 11920 | consumed samples: 9534464 | elapsed time per iteration (ms): 5849.7 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824387E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:14:21.986041 | finish at 2025-09-10 12:01:50 + [2025-09-10 07:47:34] iteration 9312/ 11920 | consumed samples: 9535488 | elapsed time per iteration (ms): 5629.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799325E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:04:41.873810 | finish at 2025-09-10 11:52:16 + [2025-09-10 07:47:40] iteration 9313/ 11920 | consumed samples: 9536512 | elapsed time per iteration (ms): 5856.0 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814187E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:14:26.612994 | finish at 2025-09-10 12:02:06 + [2025-09-10 07:47:45] iteration 9314/ 11920 | consumed samples: 9537536 | elapsed time per iteration (ms): 5619.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826817E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:04:03.121346 | finish at 2025-09-10 11:51:48 + [2025-09-10 07:47:51] iteration 9315/ 11920 | consumed samples: 9538560 | elapsed time per iteration (ms): 5615.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825048E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:03:49.492891 | finish at 2025-09-10 11:51:40 + [2025-09-10 07:47:56] iteration 9316/ 11920 | consumed samples: 9539584 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822603E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:04:08.073037 | finish at 2025-09-10 11:52:05 + [2025-09-10 07:48:02] iteration 9317/ 11920 | consumed samples: 9540608 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814492E+00 | loss scale: 1.0 | grad norm: 0.260 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:03:58.899206 | finish at 2025-09-10 11:52:01 + [2025-09-10 07:48:08] iteration 9318/ 11920 | consumed samples: 9541632 | elapsed time per iteration (ms): 5947.6 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814982E+00 | loss scale: 1.0 | grad norm: 0.247 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:17:55.555029 | finish at 2025-09-10 12:06:04 + [2025-09-10 07:48:14] iteration 9319/ 11920 | consumed samples: 9542656 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821207E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:03:42.936669 | finish at 2025-09-10 11:51:57 + [2025-09-10 07:48:19] iteration 9320/ 11920 | consumed samples: 9543680 | elapsed time per iteration (ms): 5622.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816195E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:03:38.602133 | finish at 2025-09-10 11:51:58 + [2025-09-10 07:48:25] iteration 9321/ 11920 | consumed samples: 9544704 | elapsed time per iteration (ms): 5615.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813193E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:03:13.453186 | finish at 2025-09-10 11:51:38 + [2025-09-10 07:48:31] iteration 9322/ 11920 | consumed samples: 9545728 | elapsed time per iteration (ms): 5618.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827148E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:03:15.998906 | finish at 2025-09-10 11:51:47 + [2025-09-10 07:48:36] iteration 9323/ 11920 | consumed samples: 9546752 | elapsed time per iteration (ms): 5834.8 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821348E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:12:33.036911 | finish at 2025-09-10 12:01:09 + [2025-09-10 07:48:42] iteration 9324/ 11920 | consumed samples: 9547776 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818342E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:03:22.405922 | finish at 2025-09-10 11:52:04 + [2025-09-10 07:48:48] iteration 9325/ 11920 | consumed samples: 9548800 | elapsed time per iteration (ms): 5626.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808823E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:03:20.685550 | finish at 2025-09-10 11:52:08 + [2025-09-10 07:48:53] iteration 9326/ 11920 | consumed samples: 9549824 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813727E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:02:58.794878 | finish at 2025-09-10 11:51:52 + [2025-09-10 07:48:59] iteration 9327/ 11920 | consumed samples: 9550848 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828134E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:02:55.545551 | finish at 2025-09-10 11:51:54 + [2025-09-10 07:49:04] iteration 9328/ 11920 | consumed samples: 9551872 | elapsed time per iteration (ms): 5619.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820121E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:02:44.956490 | finish at 2025-09-10 11:51:49 + [2025-09-10 07:49:10] iteration 9329/ 11920 | consumed samples: 9552896 | elapsed time per iteration (ms): 5898.8 | throughput per GPU (TFLOP/s/GPU): 76.5 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813969E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:14:43.920257 | finish at 2025-09-10 12:03:54 + [2025-09-10 07:49:16] iteration 9330/ 11920 | consumed samples: 9553920 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817215E+00 | loss scale: 1.0 | grad norm: 0.265 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:02:46.761637 | finish at 2025-09-10 11:52:03 + [2025-09-10 07:49:22] iteration 9331/ 11920 | consumed samples: 9554944 | elapsed time per iteration (ms): 5618.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817955E+00 | loss scale: 1.0 | grad norm: 0.288 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:02:26.028592 | finish at 2025-09-10 11:51:48 + [2025-09-10 07:49:28] iteration 9332/ 11920 | consumed samples: 9555968 | elapsed time per iteration (ms): 5958.8 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806618E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:17:01.277110 | finish at 2025-09-10 12:06:29 + [2025-09-10 07:49:33] iteration 9333/ 11920 | consumed samples: 9556992 | elapsed time per iteration (ms): 5885.8 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819652E+00 | loss scale: 1.0 | grad norm: 0.229 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:13:46.617963 | finish at 2025-09-10 12:03:20 + [2025-09-10 07:49:39] iteration 9334/ 11920 | consumed samples: 9558016 | elapsed time per iteration (ms): 5987.9 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804524E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:18:04.706366 | finish at 2025-09-10 12:07:44 + [2025-09-10 07:49:45] iteration 9335/ 11920 | consumed samples: 9559040 | elapsed time per iteration (ms): 6028.7 | throughput per GPU (TFLOP/s/GPU): 74.9 | MFU 7.57% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829106E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:19:44.180548 | finish at 2025-09-10 12:09:30 + [2025-09-10 07:49:51] iteration 9336/ 11920 | consumed samples: 9560064 | elapsed time per iteration (ms): 5614.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811372E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:01:48.685648 | finish at 2025-09-10 11:51:40 + [2025-09-10 07:49:57] iteration 9337/ 11920 | consumed samples: 9561088 | elapsed time per iteration (ms): 5619.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809838E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:01:53.789443 | finish at 2025-09-10 11:51:50 + [2025-09-10 07:50:02] iteration 9338/ 11920 | consumed samples: 9562112 | elapsed time per iteration (ms): 5616.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815840E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:01:42.833869 | finish at 2025-09-10 11:51:45 + [2025-09-10 07:50:08] iteration 9339/ 11920 | consumed samples: 9563136 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816368E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:01:56.816748 | finish at 2025-09-10 11:52:05 + [2025-09-10 07:50:14] iteration 9340/ 11920 | consumed samples: 9564160 | elapsed time per iteration (ms): 5940.3 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808160E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:15:25.985656 | finish at 2025-09-10 12:05:40 + [2025-09-10 07:50:20] iteration 9341/ 11920 | consumed samples: 9565184 | elapsed time per iteration (ms): 5974.6 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810575E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:16:48.438258 | finish at 2025-09-10 12:07:08 + [2025-09-10 07:50:26] iteration 9342/ 11920 | consumed samples: 9566208 | elapsed time per iteration (ms): 5927.1 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815177E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:14:40.095615 | finish at 2025-09-10 12:05:06 + [2025-09-10 07:50:32] iteration 9343/ 11920 | consumed samples: 9567232 | elapsed time per iteration (ms): 5850.2 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823000E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:11:15.930219 | finish at 2025-09-10 12:01:48 + [2025-09-10 07:50:37] iteration 9344/ 11920 | consumed samples: 9568256 | elapsed time per iteration (ms): 5635.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814829E+00 | loss scale: 1.0 | grad norm: 0.265 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:01:57.951321 | finish at 2025-09-10 11:52:35 + [2025-09-10 07:50:43] iteration 9345/ 11920 | consumed samples: 9569280 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832157E+00 | loss scale: 1.0 | grad norm: 0.345 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:01:16.520407 | finish at 2025-09-10 11:51:59 + [2025-09-10 07:50:49] iteration 9346/ 11920 | consumed samples: 9570304 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824898E+00 | loss scale: 1.0 | grad norm: 0.276 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:01:06.987415 | finish at 2025-09-10 11:51:56 + [2025-09-10 07:50:54] iteration 9347/ 11920 | consumed samples: 9571328 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826425E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 13.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:01:16.509409 | finish at 2025-09-10 11:52:11 + [2025-09-10 07:51:00] iteration 9348/ 11920 | consumed samples: 9572352 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824058E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:00:54.263806 | finish at 2025-09-10 11:51:54 + [2025-09-10 07:51:05] iteration 9349/ 11920 | consumed samples: 9573376 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825161E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:00:48.916726 | finish at 2025-09-10 11:51:54 + [2025-09-10 07:51:11] iteration 9350/ 11920 | consumed samples: 9574400 | elapsed time per iteration (ms): 5622.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813144E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:00:49.755001 | finish at 2025-09-10 11:52:01 + [2025-09-10 07:51:17] iteration 9351/ 11920 | consumed samples: 9575424 | elapsed time per iteration (ms): 5950.3 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814295E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:14:46.426606 | finish at 2025-09-10 12:06:03 + [2025-09-10 07:51:23] iteration 9352/ 11920 | consumed samples: 9576448 | elapsed time per iteration (ms): 5983.5 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826472E+00 | loss scale: 1.0 | grad norm: 0.279 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:16:05.572906 | finish at 2025-09-10 12:07:29 + [2025-09-10 07:51:29] iteration 9353/ 11920 | consumed samples: 9577472 | elapsed time per iteration (ms): 5627.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804229E+00 | loss scale: 1.0 | grad norm: 0.249 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:00:44.428453 | finish at 2025-09-10 11:52:13 + [2025-09-10 07:51:34] iteration 9354/ 11920 | consumed samples: 9578496 | elapsed time per iteration (ms): 5637.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820724E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:01:05.721731 | finish at 2025-09-10 11:52:40 + [2025-09-10 07:51:40] iteration 9355/ 11920 | consumed samples: 9579520 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820743E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:00:26.936771 | finish at 2025-09-10 11:52:07 + [2025-09-10 07:51:45] iteration 9356/ 11920 | consumed samples: 9580544 | elapsed time per iteration (ms): 5628.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832081E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:00:31.327248 | finish at 2025-09-10 11:52:17 + [2025-09-10 07:51:51] iteration 9357/ 11920 | consumed samples: 9581568 | elapsed time per iteration (ms): 5976.4 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822387E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:15:17.508662 | finish at 2025-09-10 12:07:09 + [2025-09-10 07:51:57] iteration 9358/ 11920 | consumed samples: 9582592 | elapsed time per iteration (ms): 5612.0 | throughput per GPU (TFLOP/s/GPU): 80.5 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825494E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:59:37.943360 | finish at 2025-09-10 11:51:35 + [2025-09-10 07:52:03] iteration 9359/ 11920 | consumed samples: 9583616 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820073E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:59:52.623097 | finish at 2025-09-10 11:51:55 + [2025-09-10 07:52:08] iteration 9360/ 11920 | consumed samples: 9584640 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823872E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:00:01.284790 | finish at 2025-09-10 11:52:10 + [2025-09-10 07:52:14] iteration 9361/ 11920 | consumed samples: 9585664 | elapsed time per iteration (ms): 5816.1 | throughput per GPU (TFLOP/s/GPU): 77.6 | MFU 7.85% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810647E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:08:03.458169 | finish at 2025-09-10 12:00:18 + [2025-09-10 07:52:20] iteration 9362/ 11920 | consumed samples: 9586688 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824688E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:59:35.344954 | finish at 2025-09-10 11:51:55 + [2025-09-10 07:52:25] iteration 9363/ 11920 | consumed samples: 9587712 | elapsed time per iteration (ms): 5625.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826158E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:59:45.486121 | finish at 2025-09-10 11:52:11 + [2025-09-10 07:52:31] iteration 9364/ 11920 | consumed samples: 9588736 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811938E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 10.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:59:24.167593 | finish at 2025-09-10 11:51:55 + [2025-09-10 07:52:37] iteration 9365/ 11920 | consumed samples: 9589760 | elapsed time per iteration (ms): 5632.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822591E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:59:52.050362 | finish at 2025-09-10 11:52:29 + [2025-09-10 07:52:42] iteration 9366/ 11920 | consumed samples: 9590784 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804965E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:59:24.266135 | finish at 2025-09-10 11:52:07 + [2025-09-10 07:52:48] iteration 9367/ 11920 | consumed samples: 9591808 | elapsed time per iteration (ms): 5623.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820181E+00 | loss scale: 1.0 | grad norm: 0.283 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:59:15.423198 | finish at 2025-09-10 11:52:03 + [2025-09-10 07:52:53] iteration 9368/ 11920 | consumed samples: 9592832 | elapsed time per iteration (ms): 5617.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821859E+00 | loss scale: 1.0 | grad norm: 0.487 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:58:56.750933 | finish at 2025-09-10 11:51:50 + [2025-09-10 07:52:59] iteration 9369/ 11920 | consumed samples: 9593856 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822250E+00 | loss scale: 1.0 | grad norm: 0.496 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:58:56.527869 | finish at 2025-09-10 11:51:56 + [2025-09-10 07:53:05] iteration 9370/ 11920 | consumed samples: 9594880 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828110E+00 | loss scale: 1.0 | grad norm: 0.331 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:58:54.396422 | finish at 2025-09-10 11:51:59 + [2025-09-10 07:53:10] iteration 9371/ 11920 | consumed samples: 9595904 | elapsed time per iteration (ms): 5636.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830951E+00 | loss scale: 1.0 | grad norm: 0.317 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:59:27.290520 | finish at 2025-09-10 11:52:38 + [2025-09-10 07:53:16] iteration 9372/ 11920 | consumed samples: 9596928 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834052E+00 | loss scale: 1.0 | grad norm: 0.325 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:58:44.765430 | finish at 2025-09-10 11:52:01 + [2025-09-10 07:53:22] iteration 9373/ 11920 | consumed samples: 9597952 | elapsed time per iteration (ms): 5626.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836469E+00 | loss scale: 1.0 | grad norm: 0.527 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:58:51.096009 | finish at 2025-09-10 11:52:13 + [2025-09-10 07:53:27] iteration 9374/ 11920 | consumed samples: 9598976 | elapsed time per iteration (ms): 5630.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849575E+00 | loss scale: 1.0 | grad norm: 1.074 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:58:54.485327 | finish at 2025-09-10 11:52:22 + [2025-09-10 07:53:33] iteration 9375/ 11920 | consumed samples: 9600000 | elapsed time per iteration (ms): 5629.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848547E+00 | loss scale: 1.0 | grad norm: 0.477 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:58:46.852163 | finish at 2025-09-10 11:52:20 + [2025-09-10 07:53:38] iteration 9376/ 11920 | consumed samples: 9601024 | elapsed time per iteration (ms): 5631.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861421E+00 | loss scale: 1.0 | grad norm: 0.584 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:58:46.472935 | finish at 2025-09-10 11:52:25 + [2025-09-10 07:53:44] iteration 9377/ 11920 | consumed samples: 9602048 | elapsed time per iteration (ms): 5839.1 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850379E+00 | loss scale: 1.0 | grad norm: 0.767 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:07:28.775832 | finish at 2025-09-10 12:01:13 + [2025-09-10 07:53:50] iteration 9378/ 11920 | consumed samples: 9603072 | elapsed time per iteration (ms): 5642.2 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.882029E+00 | loss scale: 1.0 | grad norm: 2.149 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:59:02.363292 | finish at 2025-09-10 11:52:52 + [2025-09-10 07:53:56] iteration 9379/ 11920 | consumed samples: 9604096 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.875096E+00 | loss scale: 1.0 | grad norm: 0.635 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:58:15.284148 | finish at 2025-09-10 11:52:11 + [2025-09-10 07:54:01] iteration 9380/ 11920 | consumed samples: 9605120 | elapsed time per iteration (ms): 5628.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858435E+00 | loss scale: 1.0 | grad norm: 1.220 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:58:16.282773 | finish at 2025-09-10 11:52:18 + [2025-09-10 07:54:07] iteration 9381/ 11920 | consumed samples: 9606144 | elapsed time per iteration (ms): 5636.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.884120E+00 | loss scale: 1.0 | grad norm: 2.041 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:58:30.552603 | finish at 2025-09-10 11:52:37 + [2025-09-10 07:54:12] iteration 9382/ 11920 | consumed samples: 9607168 | elapsed time per iteration (ms): 5628.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.877573E+00 | loss scale: 1.0 | grad norm: 0.436 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:58:04.670055 | finish at 2025-09-10 11:52:17 + [2025-09-10 07:54:18] iteration 9383/ 11920 | consumed samples: 9608192 | elapsed time per iteration (ms): 5626.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.881515E+00 | loss scale: 1.0 | grad norm: 0.498 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:57:53.210810 | finish at 2025-09-10 11:52:11 + [2025-09-10 07:54:24] iteration 9384/ 11920 | consumed samples: 9609216 | elapsed time per iteration (ms): 5648.4 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.878733E+00 | loss scale: 1.0 | grad norm: 0.877 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:58:44.283319 | finish at 2025-09-10 11:53:08 + [2025-09-10 07:54:29] iteration 9385/ 11920 | consumed samples: 9610240 | elapsed time per iteration (ms): 5636.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.883059E+00 | loss scale: 1.0 | grad norm: 1.586 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:58:07.812206 | finish at 2025-09-10 11:52:37 + [2025-09-10 07:54:35] iteration 9386/ 11920 | consumed samples: 9611264 | elapsed time per iteration (ms): 5652.0 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.929073E+00 | loss scale: 1.0 | grad norm: 3.647 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:58:42.278434 | finish at 2025-09-10 11:53:17 + [2025-09-10 07:54:41] iteration 9387/ 11920 | consumed samples: 9612288 | elapsed time per iteration (ms): 5759.4 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 6.639388E+00 | loss scale: 1.0 | grad norm: 306.783 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:03:08.649303 | finish at 2025-09-10 11:57:49 + [2025-09-10 07:54:46] iteration 9388/ 11920 | consumed samples: 9613312 | elapsed time per iteration (ms): 5648.7 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.035670E+00 | loss scale: 1.0 | grad norm: 3.607 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:58:22.493305 | finish at 2025-09-10 11:53:09 + [2025-09-10 07:54:52] iteration 9389/ 11920 | consumed samples: 9614336 | elapsed time per iteration (ms): 5737.8 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.993705E+00 | loss scale: 1.0 | grad norm: 64.126 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:02:02.314177 | finish at 2025-09-10 11:56:55 + [2025-09-10 07:54:58] iteration 9390/ 11920 | consumed samples: 9615360 | elapsed time per iteration (ms): 5675.8 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.129265E+00 | loss scale: 1.0 | grad norm: 1.385 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:59:19.832122 | finish at 2025-09-10 11:54:18 + [2025-09-10 07:55:04] iteration 9391/ 11920 | consumed samples: 9616384 | elapsed time per iteration (ms): 5656.9 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.024839E+00 | loss scale: 1.0 | grad norm: 1.010 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:58:26.398203 | finish at 2025-09-10 11:53:30 + [2025-09-10 07:55:09] iteration 9392/ 11920 | consumed samples: 9617408 | elapsed time per iteration (ms): 5674.0 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 7.209396E+00 | loss scale: 1.0 | grad norm: 19.064 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:59:03.858200 | finish at 2025-09-10 11:54:13 + [2025-09-10 07:55:15] iteration 9393/ 11920 | consumed samples: 9618432 | elapsed time per iteration (ms): 5675.9 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.084102E+00 | loss scale: 1.0 | grad norm: 1.207 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:59:02.882976 | finish at 2025-09-10 11:54:18 + [2025-09-10 07:55:21] iteration 9394/ 11920 | consumed samples: 9619456 | elapsed time per iteration (ms): 5751.2 | throughput per GPU (TFLOP/s/GPU): 78.5 | MFU 7.94% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.440982E+00 | loss scale: 1.0 | grad norm: 41.815 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:02:07.451604 | finish at 2025-09-10 11:57:28 + [2025-09-10 07:55:26] iteration 9395/ 11920 | consumed samples: 9620480 | elapsed time per iteration (ms): 5744.9 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 7.381576E+00 | loss scale: 1.0 | grad norm: 23.210 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:01:45.889928 | finish at 2025-09-10 11:57:12 + [2025-09-10 07:55:32] iteration 9396/ 11920 | consumed samples: 9621504 | elapsed time per iteration (ms): 5719.3 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.755512E+00 | loss scale: 1.0 | grad norm: 8.771 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:00:35.441068 | finish at 2025-09-10 11:56:08 + [2025-09-10 07:55:38] iteration 9397/ 11920 | consumed samples: 9622528 | elapsed time per iteration (ms): 5757.5 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 5.937361E+00 | loss scale: 1.0 | grad norm: 13.872 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:02:06.105562 | finish at 2025-09-10 11:57:44 + [2025-09-10 07:55:44] iteration 9398/ 11920 | consumed samples: 9623552 | elapsed time per iteration (ms): 6054.7 | throughput per GPU (TFLOP/s/GPU): 74.6 | MFU 7.54% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.913692E+00 | loss scale: 1.0 | grad norm: 4.963 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:14:30.064982 | finish at 2025-09-10 12:10:14 + [2025-09-10 07:55:50] iteration 9399/ 11920 | consumed samples: 9624576 | elapsed time per iteration (ms): 5748.0 | throughput per GPU (TFLOP/s/GPU): 78.5 | MFU 7.94% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.293681E+00 | loss scale: 1.0 | grad norm: 1.700 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:01:30.733610 | finish at 2025-09-10 11:57:20 + [2025-09-10 07:55:55] iteration 9400/ 11920 | consumed samples: 9625600 | elapsed time per iteration (ms): 5726.8 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.876777E+00 | loss scale: 1.0 | grad norm: 1.102 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:00:31.534109 | finish at 2025-09-10 11:56:27 + [2025-09-10 07:56:01] iteration 9401/ 11920 | consumed samples: 9626624 | elapsed time per iteration (ms): 5733.3 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.977325E+00 | loss scale: 1.0 | grad norm: 1.721 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:00:42.161005 | finish at 2025-09-10 11:56:43 + [2025-09-10 07:56:07] iteration 9402/ 11920 | consumed samples: 9627648 | elapsed time per iteration (ms): 5742.2 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.796427E+00 | loss scale: 1.0 | grad norm: 1.227 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:00:58.957198 | finish at 2025-09-10 11:57:06 + [2025-09-10 07:56:13] iteration 9403/ 11920 | consumed samples: 9628672 | elapsed time per iteration (ms): 5979.8 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.919903E+00 | loss scale: 1.0 | grad norm: 2.430 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:10:51.223176 | finish at 2025-09-10 12:07:04 + [2025-09-10 07:56:19] iteration 9404/ 11920 | consumed samples: 9629696 | elapsed time per iteration (ms): 5991.3 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.049092E+00 | loss scale: 1.0 | grad norm: 2.641 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:11:13.998293 | finish at 2025-09-10 12:07:33 + [2025-09-10 07:56:25] iteration 9405/ 11920 | consumed samples: 9630720 | elapsed time per iteration (ms): 5728.5 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.751800E+00 | loss scale: 1.0 | grad norm: 1.487 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:00:07.102869 | finish at 2025-09-10 11:56:32 + [2025-09-10 07:56:31] iteration 9406/ 11920 | consumed samples: 9631744 | elapsed time per iteration (ms): 6039.9 | throughput per GPU (TFLOP/s/GPU): 74.8 | MFU 7.56% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.565975E+00 | loss scale: 1.0 | grad norm: 5.430 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:13:04.228445 | finish at 2025-09-10 12:09:35 + [2025-09-10 07:56:37] iteration 9407/ 11920 | consumed samples: 9632768 | elapsed time per iteration (ms): 6084.4 | throughput per GPU (TFLOP/s/GPU): 74.2 | MFU 7.50% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.895249E+00 | loss scale: 1.0 | grad norm: 1.142 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:14:50.104235 | finish at 2025-09-10 12:11:27 + [2025-09-10 07:56:42] iteration 9408/ 11920 | consumed samples: 9633792 | elapsed time per iteration (ms): 5747.7 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.94% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.013845E+00 | loss scale: 1.0 | grad norm: 1.836 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:00:38.177422 | finish at 2025-09-10 11:57:21 + [2025-09-10 07:56:48] iteration 9409/ 11920 | consumed samples: 9634816 | elapsed time per iteration (ms): 5704.8 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 8.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.863399E+00 | loss scale: 1.0 | grad norm: 1.111 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:58:44.758512 | finish at 2025-09-10 11:55:33 + [2025-09-10 07:56:54] iteration 9410/ 11920 | consumed samples: 9635840 | elapsed time per iteration (ms): 6014.6 | throughput per GPU (TFLOP/s/GPU): 75.1 | MFU 7.59% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.036681E+00 | loss scale: 1.0 | grad norm: 3.574 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:11:36.622758 | finish at 2025-09-10 12:08:31 + [2025-09-10 07:57:00] iteration 9411/ 11920 | consumed samples: 9636864 | elapsed time per iteration (ms): 5701.3 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.920898E+00 | loss scale: 1.0 | grad norm: 1.455 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:58:24.545312 | finish at 2025-09-10 11:55:24 + [2025-09-10 07:57:06] iteration 9412/ 11920 | consumed samples: 9637888 | elapsed time per iteration (ms): 5685.1 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.741693E+00 | loss scale: 1.0 | grad norm: 0.750 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:57:38.116190 | finish at 2025-09-10 11:54:44 + [2025-09-10 07:57:11] iteration 9413/ 11920 | consumed samples: 9638912 | elapsed time per iteration (ms): 5756.2 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.039291E+00 | loss scale: 1.0 | grad norm: 6.460 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:00:30.735610 | finish at 2025-09-10 11:57:42 + [2025-09-10 07:57:17] iteration 9414/ 11920 | consumed samples: 9639936 | elapsed time per iteration (ms): 5720.8 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.040052E+00 | loss scale: 1.0 | grad norm: 2.274 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:58:56.245740 | finish at 2025-09-10 11:56:13 + [2025-09-10 07:57:23] iteration 9415/ 11920 | consumed samples: 9640960 | elapsed time per iteration (ms): 5721.2 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.838370E+00 | loss scale: 1.0 | grad norm: 1.307 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:58:51.677642 | finish at 2025-09-10 11:56:14 + [2025-09-10 07:57:28] iteration 9416/ 11920 | consumed samples: 9641984 | elapsed time per iteration (ms): 5703.6 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.786053E+00 | loss scale: 1.0 | grad norm: 2.124 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:58:01.869745 | finish at 2025-09-10 11:55:30 + [2025-09-10 07:57:34] iteration 9417/ 11920 | consumed samples: 9643008 | elapsed time per iteration (ms): 5724.9 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.830330E+00 | loss scale: 1.0 | grad norm: 1.781 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:58:49.519603 | finish at 2025-09-10 11:56:24 + [2025-09-10 07:57:40] iteration 9418/ 11920 | consumed samples: 9644032 | elapsed time per iteration (ms): 5688.5 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.737825E+00 | loss scale: 1.0 | grad norm: 0.891 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:57:12.645907 | finish at 2025-09-10 11:54:53 + [2025-09-10 07:57:46] iteration 9419/ 11920 | consumed samples: 9645056 | elapsed time per iteration (ms): 5696.4 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.660286E+00 | loss scale: 1.0 | grad norm: 0.990 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:57:26.646726 | finish at 2025-09-10 11:55:12 + [2025-09-10 07:57:51] iteration 9420/ 11920 | consumed samples: 9646080 | elapsed time per iteration (ms): 5711.4 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.613642E+00 | loss scale: 1.0 | grad norm: 1.295 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:57:58.492928 | finish at 2025-09-10 11:55:50 + [2025-09-10 07:57:57] iteration 9421/ 11920 | consumed samples: 9647104 | elapsed time per iteration (ms): 5691.7 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.583628E+00 | loss scale: 1.0 | grad norm: 1.152 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:57:03.614260 | finish at 2025-09-10 11:55:01 + [2025-09-10 07:58:03] iteration 9422/ 11920 | consumed samples: 9648128 | elapsed time per iteration (ms): 6020.5 | throughput per GPU (TFLOP/s/GPU): 75.0 | MFU 7.58% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.554032E+00 | loss scale: 1.0 | grad norm: 0.909 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:10:39.096299 | finish at 2025-09-10 12:08:42 + [2025-09-10 07:58:09] iteration 9423/ 11920 | consumed samples: 9649152 | elapsed time per iteration (ms): 5688.2 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.503321E+00 | loss scale: 1.0 | grad norm: 0.541 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:56:43.534813 | finish at 2025-09-10 11:54:52 + [2025-09-10 07:58:14] iteration 9424/ 11920 | consumed samples: 9650176 | elapsed time per iteration (ms): 5692.7 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.483929E+00 | loss scale: 1.0 | grad norm: 1.559 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:56:48.877213 | finish at 2025-09-10 11:55:03 + [2025-09-10 07:58:20] iteration 9425/ 11920 | consumed samples: 9651200 | elapsed time per iteration (ms): 5699.4 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.660709E+00 | loss scale: 1.0 | grad norm: 2.753 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:57:00.057597 | finish at 2025-09-10 11:55:20 + [2025-09-10 07:58:26] iteration 9426/ 11920 | consumed samples: 9652224 | elapsed time per iteration (ms): 5960.1 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.567715E+00 | loss scale: 1.0 | grad norm: 1.503 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:07:44.532052 | finish at 2025-09-10 12:06:11 + [2025-09-10 07:58:32] iteration 9427/ 11920 | consumed samples: 9653248 | elapsed time per iteration (ms): 5697.4 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.542091E+00 | loss scale: 1.0 | grad norm: 1.591 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:56:43.639235 | finish at 2025-09-10 11:55:15 + [2025-09-10 07:58:38] iteration 9428/ 11920 | consumed samples: 9654272 | elapsed time per iteration (ms): 6078.8 | throughput per GPU (TFLOP/s/GPU): 74.3 | MFU 7.51% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.581698E+00 | loss scale: 1.0 | grad norm: 2.001 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:12:28.378420 | finish at 2025-09-10 12:11:06 + [2025-09-10 07:58:44] iteration 9429/ 11920 | consumed samples: 9655296 | elapsed time per iteration (ms): 5717.5 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.616242E+00 | loss scale: 1.0 | grad norm: 2.689 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:57:22.378544 | finish at 2025-09-10 11:56:06 + [2025-09-10 07:58:49] iteration 9430/ 11920 | consumed samples: 9656320 | elapsed time per iteration (ms): 5687.9 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.532301E+00 | loss scale: 1.0 | grad norm: 0.610 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:56:02.802894 | finish at 2025-09-10 11:54:52 + [2025-09-10 07:58:55] iteration 9431/ 11920 | consumed samples: 9657344 | elapsed time per iteration (ms): 6281.0 | throughput per GPU (TFLOP/s/GPU): 71.9 | MFU 7.27% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.437936E+00 | loss scale: 1.0 | grad norm: 0.583 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:20:33.532774 | finish at 2025-09-10 12:19:29 + [2025-09-10 07:59:01] iteration 9432/ 11920 | consumed samples: 9658368 | elapsed time per iteration (ms): 5702.7 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.511835E+00 | loss scale: 1.0 | grad norm: 2.538 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:56:28.309639 | finish at 2025-09-10 11:55:30 + [2025-09-10 07:59:07] iteration 9433/ 11920 | consumed samples: 9659392 | elapsed time per iteration (ms): 5692.5 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.584072E+00 | loss scale: 1.0 | grad norm: 4.215 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:55:57.149948 | finish at 2025-09-10 11:55:04 + [2025-09-10 07:59:13] iteration 9434/ 11920 | consumed samples: 9660416 | elapsed time per iteration (ms): 5727.4 | throughput per GPU (TFLOP/s/GPU): 78.8 | MFU 7.97% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.794792E+00 | loss scale: 1.0 | grad norm: 6.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:57:18.436880 | finish at 2025-09-10 11:56:31 + [2025-09-10 07:59:18] iteration 9435/ 11920 | consumed samples: 9661440 | elapsed time per iteration (ms): 5710.2 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.574339E+00 | loss scale: 1.0 | grad norm: 1.013 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:56:29.805112 | finish at 2025-09-10 11:55:48 + [2025-09-10 07:59:24] iteration 9436/ 11920 | consumed samples: 9662464 | elapsed time per iteration (ms): 5692.8 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.488400E+00 | loss scale: 1.0 | grad norm: 0.722 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:55:40.929525 | finish at 2025-09-10 11:55:05 + [2025-09-10 07:59:30] iteration 9437/ 11920 | consumed samples: 9663488 | elapsed time per iteration (ms): 6315.6 | throughput per GPU (TFLOP/s/GPU): 71.5 | MFU 7.23% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.432194E+00 | loss scale: 1.0 | grad norm: 1.503 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:21:21.552310 | finish at 2025-09-10 12:20:52 + [2025-09-10 07:59:36] iteration 9438/ 11920 | consumed samples: 9664512 | elapsed time per iteration (ms): 5719.3 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.581863E+00 | loss scale: 1.0 | grad norm: 2.913 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:56:35.293211 | finish at 2025-09-10 11:56:11 + [2025-09-10 07:59:42] iteration 9439/ 11920 | consumed samples: 9665536 | elapsed time per iteration (ms): 5863.5 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.457272E+00 | loss scale: 1.0 | grad norm: 0.487 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:02:27.408860 | finish at 2025-09-10 12:02:09 + [2025-09-10 07:59:48] iteration 9440/ 11920 | consumed samples: 9666560 | elapsed time per iteration (ms): 5949.0 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.432909E+00 | loss scale: 1.0 | grad norm: 1.052 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:05:53.473587 | finish at 2025-09-10 12:05:41 + [2025-09-10 07:59:54] iteration 9441/ 11920 | consumed samples: 9667584 | elapsed time per iteration (ms): 5662.8 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.407739E+00 | loss scale: 1.0 | grad norm: 0.597 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:53:58.068380 | finish at 2025-09-10 11:53:52 + [2025-09-10 07:59:59] iteration 9442/ 11920 | consumed samples: 9668608 | elapsed time per iteration (ms): 5653.6 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.355947E+00 | loss scale: 1.0 | grad norm: 0.538 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:53:29.499039 | finish at 2025-09-10 11:53:29 + [2025-09-10 08:00:05] iteration 9443/ 11920 | consumed samples: 9669632 | elapsed time per iteration (ms): 5654.5 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.338896E+00 | loss scale: 1.0 | grad norm: 0.752 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:53:26.209511 | finish at 2025-09-10 11:53:31 + [2025-09-10 08:00:11] iteration 9444/ 11920 | consumed samples: 9670656 | elapsed time per iteration (ms): 5689.5 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.368884E+00 | loss scale: 1.0 | grad norm: 1.590 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:54:47.131983 | finish at 2025-09-10 11:54:58 + [2025-09-10 08:00:16] iteration 9445/ 11920 | consumed samples: 9671680 | elapsed time per iteration (ms): 5664.1 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.318309E+00 | loss scale: 1.0 | grad norm: 0.498 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:53:38.595994 | finish at 2025-09-10 11:53:55 + [2025-09-10 08:00:22] iteration 9446/ 11920 | consumed samples: 9672704 | elapsed time per iteration (ms): 5996.6 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.295538E+00 | loss scale: 1.0 | grad norm: 0.736 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:07:15.550434 | finish at 2025-09-10 12:07:38 + [2025-09-10 08:00:28] iteration 9447/ 11920 | consumed samples: 9673728 | elapsed time per iteration (ms): 5666.8 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.287606E+00 | loss scale: 1.0 | grad norm: 1.074 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:53:33.907424 | finish at 2025-09-10 11:54:02 + [2025-09-10 08:00:34] iteration 9448/ 11920 | consumed samples: 9674752 | elapsed time per iteration (ms): 5940.7 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.291055E+00 | loss scale: 1.0 | grad norm: 1.216 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:04:45.492456 | finish at 2025-09-10 12:05:19 + [2025-09-10 08:00:40] iteration 9449/ 11920 | consumed samples: 9675776 | elapsed time per iteration (ms): 6011.4 | throughput per GPU (TFLOP/s/GPU): 75.1 | MFU 7.59% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.352618E+00 | loss scale: 1.0 | grad norm: 2.322 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:07:34.271870 | finish at 2025-09-10 12:08:14 + [2025-09-10 08:00:46] iteration 9450/ 11920 | consumed samples: 9676800 | elapsed time per iteration (ms): 5916.8 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.265861E+00 | loss scale: 1.0 | grad norm: 0.578 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:03:34.412432 | finish at 2025-09-10 12:04:20 + [2025-09-10 08:00:52] iteration 9451/ 11920 | consumed samples: 9677824 | elapsed time per iteration (ms): 6048.1 | throughput per GPU (TFLOP/s/GPU): 74.6 | MFU 7.55% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.249731E+00 | loss scale: 1.0 | grad norm: 0.888 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:08:52.727688 | finish at 2025-09-10 12:09:45 + [2025-09-10 08:00:57] iteration 9452/ 11920 | consumed samples: 9678848 | elapsed time per iteration (ms): 5674.2 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.220454E+00 | loss scale: 1.0 | grad norm: 0.426 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:53:23.860429 | finish at 2025-09-10 11:54:21 + [2025-09-10 08:01:03] iteration 9453/ 11920 | consumed samples: 9679872 | elapsed time per iteration (ms): 5962.3 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.209775E+00 | loss scale: 1.0 | grad norm: 0.632 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:05:08.871324 | finish at 2025-09-10 12:06:12 + [2025-09-10 08:01:09] iteration 9454/ 11920 | consumed samples: 9680896 | elapsed time per iteration (ms): 5668.5 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.181201E+00 | loss scale: 1.0 | grad norm: 0.693 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:52:58.404456 | finish at 2025-09-10 11:54:07 + [2025-09-10 08:01:15] iteration 9455/ 11920 | consumed samples: 9681920 | elapsed time per iteration (ms): 5672.1 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.191354E+00 | loss scale: 1.0 | grad norm: 0.703 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:53:01.813058 | finish at 2025-09-10 11:54:17 + [2025-09-10 08:01:20] iteration 9456/ 11920 | consumed samples: 9682944 | elapsed time per iteration (ms): 5673.0 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.176873E+00 | loss scale: 1.0 | grad norm: 0.629 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:52:58.200569 | finish at 2025-09-10 11:54:19 + [2025-09-10 08:01:26] iteration 9457/ 11920 | consumed samples: 9683968 | elapsed time per iteration (ms): 5886.0 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.152058E+00 | loss scale: 1.0 | grad norm: 0.844 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:01:37.328196 | finish at 2025-09-10 12:03:04 + [2025-09-10 08:01:32] iteration 9458/ 11920 | consumed samples: 9684992 | elapsed time per iteration (ms): 5876.2 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.133526E+00 | loss scale: 1.0 | grad norm: 0.695 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:01:07.255958 | finish at 2025-09-10 12:02:39 + [2025-09-10 08:01:38] iteration 9459/ 11920 | consumed samples: 9686016 | elapsed time per iteration (ms): 5868.8 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.124534E+00 | loss scale: 1.0 | grad norm: 0.621 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:00:43.146539 | finish at 2025-09-10 12:02:21 + [2025-09-10 08:01:44] iteration 9460/ 11920 | consumed samples: 9687040 | elapsed time per iteration (ms): 5893.6 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.108405E+00 | loss scale: 1.0 | grad norm: 0.571 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:01:38.187933 | finish at 2025-09-10 12:03:22 + [2025-09-10 08:01:50] iteration 9461/ 11920 | consumed samples: 9688064 | elapsed time per iteration (ms): 5659.4 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.127938E+00 | loss scale: 1.0 | grad norm: 0.660 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:51:56.368418 | finish at 2025-09-10 11:53:46 + [2025-09-10 08:01:55] iteration 9462/ 11920 | consumed samples: 9689088 | elapsed time per iteration (ms): 5652.4 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.087313E+00 | loss scale: 1.0 | grad norm: 0.320 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:51:33.690076 | finish at 2025-09-10 11:53:29 + [2025-09-10 08:02:01] iteration 9463/ 11920 | consumed samples: 9690112 | elapsed time per iteration (ms): 5972.7 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.081447E+00 | loss scale: 1.0 | grad norm: 0.469 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:04:34.871471 | finish at 2025-09-10 12:06:36 + [2025-09-10 08:02:07] iteration 9464/ 11920 | consumed samples: 9691136 | elapsed time per iteration (ms): 5654.0 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.117481E+00 | loss scale: 1.0 | grad norm: 1.593 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:51:26.341219 | finish at 2025-09-10 11:53:33 + [2025-09-10 08:02:13] iteration 9465/ 11920 | consumed samples: 9692160 | elapsed time per iteration (ms): 5654.0 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.068111E+00 | loss scale: 1.0 | grad norm: 0.750 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:51:20.636834 | finish at 2025-09-10 11:53:33 + [2025-09-10 08:02:18] iteration 9466/ 11920 | consumed samples: 9693184 | elapsed time per iteration (ms): 5650.0 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.076791E+00 | loss scale: 1.0 | grad norm: 0.549 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:51:04.979708 | finish at 2025-09-10 11:53:23 + [2025-09-10 08:02:24] iteration 9467/ 11920 | consumed samples: 9694208 | elapsed time per iteration (ms): 5656.2 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.069839E+00 | loss scale: 1.0 | grad norm: 0.767 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:51:14.735048 | finish at 2025-09-10 11:53:39 + [2025-09-10 08:02:30] iteration 9468/ 11920 | consumed samples: 9695232 | elapsed time per iteration (ms): 6001.2 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.063571E+00 | loss scale: 1.0 | grad norm: 0.813 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:05:15.001348 | finish at 2025-09-10 12:07:45 + [2025-09-10 08:02:36] iteration 9469/ 11920 | consumed samples: 9696256 | elapsed time per iteration (ms): 5659.3 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.053307E+00 | loss scale: 1.0 | grad norm: 0.593 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:51:10.938090 | finish at 2025-09-10 11:53:46 + [2025-09-10 08:02:41] iteration 9470/ 11920 | consumed samples: 9697280 | elapsed time per iteration (ms): 5645.4 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.038743E+00 | loss scale: 1.0 | grad norm: 0.517 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:50:31.306052 | finish at 2025-09-10 11:53:12 + [2025-09-10 08:02:47] iteration 9471/ 11920 | consumed samples: 9698304 | elapsed time per iteration (ms): 5645.0 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.047640E+00 | loss scale: 1.0 | grad norm: 0.587 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:50:24.503941 | finish at 2025-09-10 11:53:11 + [2025-09-10 08:02:53] iteration 9472/ 11920 | consumed samples: 9699328 | elapsed time per iteration (ms): 5840.2 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.032717E+00 | loss scale: 1.0 | grad norm: 0.711 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:58:16.794880 | finish at 2025-09-10 12:01:09 + [2025-09-10 08:02:58] iteration 9473/ 11920 | consumed samples: 9700352 | elapsed time per iteration (ms): 5638.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.014224E+00 | loss scale: 1.0 | grad norm: 0.555 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:49:56.201197 | finish at 2025-09-10 11:52:54 + [2025-09-10 08:03:04] iteration 9474/ 11920 | consumed samples: 9701376 | elapsed time per iteration (ms): 5637.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.019733E+00 | loss scale: 1.0 | grad norm: 0.548 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:49:49.901290 | finish at 2025-09-10 11:52:54 + [2025-09-10 08:03:10] iteration 9475/ 11920 | consumed samples: 9702400 | elapsed time per iteration (ms): 5643.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.010108E+00 | loss scale: 1.0 | grad norm: 0.654 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:49:57.788776 | finish at 2025-09-10 11:53:07 + [2025-09-10 08:03:15] iteration 9476/ 11920 | consumed samples: 9703424 | elapsed time per iteration (ms): 5639.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.001134E+00 | loss scale: 1.0 | grad norm: 0.682 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:49:42.324184 | finish at 2025-09-10 11:52:58 + [2025-09-10 08:03:21] iteration 9477/ 11920 | consumed samples: 9704448 | elapsed time per iteration (ms): 5640.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.997581E+00 | loss scale: 1.0 | grad norm: 0.421 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:49:38.796341 | finish at 2025-09-10 11:53:00 + [2025-09-10 08:03:26] iteration 9478/ 11920 | consumed samples: 9705472 | elapsed time per iteration (ms): 5639.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.986248E+00 | loss scale: 1.0 | grad norm: 0.466 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:49:30.787764 | finish at 2025-09-10 11:52:57 + [2025-09-10 08:03:32] iteration 9479/ 11920 | consumed samples: 9706496 | elapsed time per iteration (ms): 5643.7 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.974823E+00 | loss scale: 1.0 | grad norm: 0.577 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:49:36.306337 | finish at 2025-09-10 11:53:08 + [2025-09-10 08:03:38] iteration 9480/ 11920 | consumed samples: 9707520 | elapsed time per iteration (ms): 5961.9 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.985927E+00 | loss scale: 1.0 | grad norm: 0.880 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:02:27.083693 | finish at 2025-09-10 12:06:05 + [2025-09-10 08:03:44] iteration 9481/ 11920 | consumed samples: 9708544 | elapsed time per iteration (ms): 5633.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.976308E+00 | loss scale: 1.0 | grad norm: 0.399 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:49:00.467855 | finish at 2025-09-10 11:52:44 + [2025-09-10 08:03:49] iteration 9482/ 11920 | consumed samples: 9709568 | elapsed time per iteration (ms): 5658.1 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.970067E+00 | loss scale: 1.0 | grad norm: 0.719 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:49:54.441719 | finish at 2025-09-10 11:53:44 + [2025-09-10 08:03:55] iteration 9483/ 11920 | consumed samples: 9710592 | elapsed time per iteration (ms): 5639.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.960183E+00 | loss scale: 1.0 | grad norm: 0.632 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:49:04.557078 | finish at 2025-09-10 11:53:00 + [2025-09-10 08:04:01] iteration 9484/ 11920 | consumed samples: 9711616 | elapsed time per iteration (ms): 6000.7 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.984934E+00 | loss scale: 1.0 | grad norm: 0.832 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:03:37.635498 | finish at 2025-09-10 12:07:39 + [2025-09-10 08:04:07] iteration 9485/ 11920 | consumed samples: 9712640 | elapsed time per iteration (ms): 5635.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.965871E+00 | loss scale: 1.0 | grad norm: 0.388 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:48:43.077509 | finish at 2025-09-10 11:52:50 + [2025-09-10 08:04:12] iteration 9486/ 11920 | consumed samples: 9713664 | elapsed time per iteration (ms): 5639.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.956899E+00 | loss scale: 1.0 | grad norm: 0.531 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:48:47.398722 | finish at 2025-09-10 11:53:00 + [2025-09-10 08:04:18] iteration 9487/ 11920 | consumed samples: 9714688 | elapsed time per iteration (ms): 5630.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.974143E+00 | loss scale: 1.0 | grad norm: 0.679 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:48:18.540892 | finish at 2025-09-10 11:52:36 + [2025-09-10 08:04:24] iteration 9488/ 11920 | consumed samples: 9715712 | elapsed time per iteration (ms): 5635.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.953333E+00 | loss scale: 1.0 | grad norm: 0.535 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:48:26.344177 | finish at 2025-09-10 11:52:50 + [2025-09-10 08:04:29] iteration 9489/ 11920 | consumed samples: 9716736 | elapsed time per iteration (ms): 5631.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.943486E+00 | loss scale: 1.0 | grad norm: 0.438 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:48:09.908161 | finish at 2025-09-10 11:52:39 + [2025-09-10 08:04:35] iteration 9490/ 11920 | consumed samples: 9717760 | elapsed time per iteration (ms): 5634.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.953582E+00 | loss scale: 1.0 | grad norm: 0.306 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:48:11.209939 | finish at 2025-09-10 11:52:46 + [2025-09-10 08:04:40] iteration 9491/ 11920 | consumed samples: 9718784 | elapsed time per iteration (ms): 5628.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.937231E+00 | loss scale: 1.0 | grad norm: 0.303 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:47:52.279130 | finish at 2025-09-10 11:52:33 + [2025-09-10 08:04:46] iteration 9492/ 11920 | consumed samples: 9719808 | elapsed time per iteration (ms): 5626.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.947900E+00 | loss scale: 1.0 | grad norm: 0.429 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:47:41.828288 | finish at 2025-09-10 11:52:28 + [2025-09-10 08:04:52] iteration 9493/ 11920 | consumed samples: 9720832 | elapsed time per iteration (ms): 5636.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.948818E+00 | loss scale: 1.0 | grad norm: 0.875 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:47:58.886003 | finish at 2025-09-10 11:52:51 + [2025-09-10 08:04:57] iteration 9494/ 11920 | consumed samples: 9721856 | elapsed time per iteration (ms): 5633.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.938661E+00 | loss scale: 1.0 | grad norm: 0.465 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:47:47.146561 | finish at 2025-09-10 11:52:45 + [2025-09-10 08:05:03] iteration 9495/ 11920 | consumed samples: 9722880 | elapsed time per iteration (ms): 5625.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.949921E+00 | loss scale: 1.0 | grad norm: 0.902 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:47:22.831856 | finish at 2025-09-10 11:52:26 + [2025-09-10 08:05:09] iteration 9496/ 11920 | consumed samples: 9723904 | elapsed time per iteration (ms): 5626.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.938689E+00 | loss scale: 1.0 | grad norm: 0.460 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:47:17.846289 | finish at 2025-09-10 11:52:26 + [2025-09-10 08:05:14] iteration 9497/ 11920 | consumed samples: 9724928 | elapsed time per iteration (ms): 5635.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.960636E+00 | loss scale: 1.0 | grad norm: 0.957 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:47:34.089657 | finish at 2025-09-10 11:52:48 + [2025-09-10 08:05:20] iteration 9498/ 11920 | consumed samples: 9725952 | elapsed time per iteration (ms): 5642.5 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.933923E+00 | loss scale: 1.0 | grad norm: 0.520 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:47:46.015283 | finish at 2025-09-10 11:53:06 + [2025-09-10 08:05:26] iteration 9499/ 11920 | consumed samples: 9726976 | elapsed time per iteration (ms): 5632.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.946422E+00 | loss scale: 1.0 | grad norm: 0.532 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:47:16.969220 | finish at 2025-09-10 11:52:42 + [2025-09-10 08:05:31] iteration 9500/ 11920 | consumed samples: 9728000 | elapsed time per iteration (ms): 5844.4 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.918791E+00 | loss scale: 1.0 | grad norm: 0.320 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:55:43.385515 | finish at 2025-09-10 12:01:15 + [2025-09-10 08:05:37] iteration 9501/ 11920 | consumed samples: 9729024 | elapsed time per iteration (ms): 5637.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.931550E+00 | loss scale: 1.0 | grad norm: 0.394 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:47:16.528960 | finish at 2025-09-10 11:52:54 + [2025-09-10 08:05:43] iteration 9502/ 11920 | consumed samples: 9730048 | elapsed time per iteration (ms): 5629.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.924482E+00 | loss scale: 1.0 | grad norm: 0.364 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:46:52.221298 | finish at 2025-09-10 11:52:35 + [2025-09-10 08:05:48] iteration 9503/ 11920 | consumed samples: 9731072 | elapsed time per iteration (ms): 5630.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.917232E+00 | loss scale: 1.0 | grad norm: 0.492 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:46:47.651498 | finish at 2025-09-10 11:52:36 + [2025-09-10 08:05:54] iteration 9504/ 11920 | consumed samples: 9732096 | elapsed time per iteration (ms): 5628.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.912621E+00 | loss scale: 1.0 | grad norm: 0.537 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:46:37.443321 | finish at 2025-09-10 11:52:31 + [2025-09-10 08:06:00] iteration 9505/ 11920 | consumed samples: 9733120 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.931456E+00 | loss scale: 1.0 | grad norm: 0.458 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:46:15.084200 | finish at 2025-09-10 11:52:15 + [2025-09-10 08:06:05] iteration 9506/ 11920 | consumed samples: 9734144 | elapsed time per iteration (ms): 5628.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.921635E+00 | loss scale: 1.0 | grad norm: 0.822 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:46:26.083562 | finish at 2025-09-10 11:52:31 + [2025-09-10 08:06:11] iteration 9507/ 11920 | consumed samples: 9735168 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.937186E+00 | loss scale: 1.0 | grad norm: 0.303 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:46:13.211296 | finish at 2025-09-10 11:52:24 + [2025-09-10 08:06:16] iteration 9508/ 11920 | consumed samples: 9736192 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.922745E+00 | loss scale: 1.0 | grad norm: 0.254 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:46:02.194445 | finish at 2025-09-10 11:52:19 + [2025-09-10 08:06:22] iteration 9509/ 11920 | consumed samples: 9737216 | elapsed time per iteration (ms): 5866.8 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.910007E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:55:44.957862 | finish at 2025-09-10 12:02:07 + [2025-09-10 08:06:28] iteration 9510/ 11920 | consumed samples: 9738240 | elapsed time per iteration (ms): 5848.3 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.900932E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:54:54.302399 | finish at 2025-09-10 12:01:22 + [2025-09-10 08:06:34] iteration 9511/ 11920 | consumed samples: 9739264 | elapsed time per iteration (ms): 5628.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.903449E+00 | loss scale: 1.0 | grad norm: 0.252 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:45:59.482631 | finish at 2025-09-10 11:52:33 + [2025-09-10 08:06:39] iteration 9512/ 11920 | consumed samples: 9740288 | elapsed time per iteration (ms): 5630.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.903488E+00 | loss scale: 1.0 | grad norm: 0.259 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:45:59.295961 | finish at 2025-09-10 11:52:39 + [2025-09-10 08:06:45] iteration 9513/ 11920 | consumed samples: 9741312 | elapsed time per iteration (ms): 5621.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.895628E+00 | loss scale: 1.0 | grad norm: 0.238 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:45:29.754584 | finish at 2025-09-10 11:52:15 + [2025-09-10 08:06:51] iteration 9514/ 11920 | consumed samples: 9742336 | elapsed time per iteration (ms): 5626.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.896708E+00 | loss scale: 1.0 | grad norm: 0.262 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:45:37.593929 | finish at 2025-09-10 11:52:28 + [2025-09-10 08:06:56] iteration 9515/ 11920 | consumed samples: 9743360 | elapsed time per iteration (ms): 5628.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.895100E+00 | loss scale: 1.0 | grad norm: 0.326 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:45:36.423197 | finish at 2025-09-10 11:52:33 + [2025-09-10 08:07:02] iteration 9516/ 11920 | consumed samples: 9744384 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.886749E+00 | loss scale: 1.0 | grad norm: 0.400 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:45:11.738380 | finish at 2025-09-10 11:52:14 + [2025-09-10 08:07:07] iteration 9517/ 11920 | consumed samples: 9745408 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.906491E+00 | loss scale: 1.0 | grad norm: 0.345 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:45:15.116136 | finish at 2025-09-10 11:52:23 + [2025-09-10 08:07:13] iteration 9518/ 11920 | consumed samples: 9746432 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.897842E+00 | loss scale: 1.0 | grad norm: 0.304 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:45:12.042018 | finish at 2025-09-10 11:52:25 + [2025-09-10 08:07:19] iteration 9519/ 11920 | consumed samples: 9747456 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.892510E+00 | loss scale: 1.0 | grad norm: 0.335 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:44:56.943329 | finish at 2025-09-10 11:52:16 + [2025-09-10 08:07:25] iteration 9520/ 11920 | consumed samples: 9748480 | elapsed time per iteration (ms): 5845.1 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.891681E+00 | loss scale: 1.0 | grad norm: 0.364 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:53:48.329659 | finish at 2025-09-10 12:01:13 + [2025-09-10 08:07:30] iteration 9521/ 11920 | consumed samples: 9749504 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.903732E+00 | loss scale: 1.0 | grad norm: 0.462 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:45:00.000287 | finish at 2025-09-10 11:52:30 + [2025-09-10 08:07:36] iteration 9522/ 11920 | consumed samples: 9750528 | elapsed time per iteration (ms): 5632.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.900514E+00 | loss scale: 1.0 | grad norm: 0.511 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:45:05.960149 | finish at 2025-09-10 11:52:42 + [2025-09-10 08:07:41] iteration 9523/ 11920 | consumed samples: 9751552 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.892887E+00 | loss scale: 1.0 | grad norm: 0.429 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:44:38.138184 | finish at 2025-09-10 11:52:20 + [2025-09-10 08:07:47] iteration 9524/ 11920 | consumed samples: 9752576 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.885285E+00 | loss scale: 1.0 | grad norm: 0.306 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:44:30.390782 | finish at 2025-09-10 11:52:17 + [2025-09-10 08:07:53] iteration 9525/ 11920 | consumed samples: 9753600 | elapsed time per iteration (ms): 5835.5 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.889153E+00 | loss scale: 1.0 | grad norm: 0.287 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:52:55.944276 | finish at 2025-09-10 12:00:49 + [2025-09-10 08:07:59] iteration 9526/ 11920 | consumed samples: 9754624 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.878915E+00 | loss scale: 1.0 | grad norm: 0.286 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:44:18.127885 | finish at 2025-09-10 11:52:17 + [2025-09-10 08:08:04] iteration 9527/ 11920 | consumed samples: 9755648 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.886319E+00 | loss scale: 1.0 | grad norm: 0.289 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:44:15.249413 | finish at 2025-09-10 11:52:19 + [2025-09-10 08:08:10] iteration 9528/ 11920 | consumed samples: 9756672 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.871051E+00 | loss scale: 1.0 | grad norm: 0.278 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:44:03.302633 | finish at 2025-09-10 11:52:13 + [2025-09-10 08:08:15] iteration 9529/ 11920 | consumed samples: 9757696 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.884843E+00 | loss scale: 1.0 | grad norm: 0.263 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:44:09.852709 | finish at 2025-09-10 11:52:25 + [2025-09-10 08:08:21] iteration 9530/ 11920 | consumed samples: 9758720 | elapsed time per iteration (ms): 5629.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.869960E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:44:14.426155 | finish at 2025-09-10 11:52:35 + [2025-09-10 08:08:27] iteration 9531/ 11920 | consumed samples: 9759744 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.874338E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:43:54.401643 | finish at 2025-09-10 11:52:21 + [2025-09-10 08:08:32] iteration 9532/ 11920 | consumed samples: 9760768 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.868459E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:43:54.063987 | finish at 2025-09-10 11:52:26 + [2025-09-10 08:08:38] iteration 9533/ 11920 | consumed samples: 9761792 | elapsed time per iteration (ms): 5628.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.880136E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:43:55.482146 | finish at 2025-09-10 11:52:33 + [2025-09-10 08:08:44] iteration 9534/ 11920 | consumed samples: 9762816 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.878128E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:43:34.820099 | finish at 2025-09-10 11:52:18 + [2025-09-10 08:08:49] iteration 9535/ 11920 | consumed samples: 9763840 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858746E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:43:34.343880 | finish at 2025-09-10 11:52:24 + [2025-09-10 08:08:55] iteration 9536/ 11920 | consumed samples: 9764864 | elapsed time per iteration (ms): 5618.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.871890E+00 | loss scale: 1.0 | grad norm: 0.275 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:43:15.421368 | finish at 2025-09-10 11:52:10 +(min, max) time across ranks (ms): + save-checkpoint ................................: (3695.53, 3695.94) + [2025-09-10 08:09:04] iteration 9537/ 11920 | consumed samples: 9765888 | elapsed time per iteration (ms): 5606.2 | throughput per GPU (TFLOP/s/GPU): 80.5 | MFU 8.14% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.872221E+00 | loss scale: 1.0 | grad norm: 0.415 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:42:39.517737 | finish at 2025-09-10 11:51:44 + [2025-09-10 08:09:10] iteration 9538/ 11920 | consumed samples: 9766912 | elapsed time per iteration (ms): 5846.8 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.877483E+00 | loss scale: 1.0 | grad norm: 0.434 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:52:07.101664 | finish at 2025-09-10 12:01:17 + [2025-09-10 08:09:16] iteration 9539/ 11920 | consumed samples: 9767936 | elapsed time per iteration (ms): 5996.9 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870371E+00 | loss scale: 1.0 | grad norm: 0.285 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:57:58.563462 | finish at 2025-09-10 12:07:14 + [2025-09-10 08:09:22] iteration 9540/ 11920 | consumed samples: 9768960 | elapsed time per iteration (ms): 6211.2 | throughput per GPU (TFLOP/s/GPU): 72.7 | MFU 7.35% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856107E+00 | loss scale: 1.0 | grad norm: 0.316 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 4:06:22.552724 | finish at 2025-09-10 12:15:45 + [2025-09-10 08:09:28] iteration 9541/ 11920 | consumed samples: 9769984 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.863407E+00 | loss scale: 1.0 | grad norm: 0.343 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:43:01.871597 | finish at 2025-09-10 11:52:30 + [2025-09-10 08:09:33] iteration 9542/ 11920 | consumed samples: 9771008 | elapsed time per iteration (ms): 5628.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.871598E+00 | loss scale: 1.0 | grad norm: 0.284 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:43:03.838184 | finish at 2025-09-10 11:52:37 + [2025-09-10 08:09:39] iteration 9543/ 11920 | consumed samples: 9772032 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858816E+00 | loss scale: 1.0 | grad norm: 0.251 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:42:43.488850 | finish at 2025-09-10 11:52:23 + [2025-09-10 08:09:45] iteration 9544/ 11920 | consumed samples: 9773056 | elapsed time per iteration (ms): 5629.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.865034E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:42:55.866835 | finish at 2025-09-10 11:52:41 + [2025-09-10 08:09:50] iteration 9545/ 11920 | consumed samples: 9774080 | elapsed time per iteration (ms): 5641.9 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858058E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:43:19.409592 | finish at 2025-09-10 11:53:10 + [2025-09-10 08:09:56] iteration 9546/ 11920 | consumed samples: 9775104 | elapsed time per iteration (ms): 5964.9 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851099E+00 | loss scale: 1.0 | grad norm: 0.252 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:56:00.571891 | finish at 2025-09-10 12:05:57 + [2025-09-10 08:10:02] iteration 9547/ 11920 | consumed samples: 9776128 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867927E+00 | loss scale: 1.0 | grad norm: 0.251 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:42:25.612993 | finish at 2025-09-10 11:52:27 + [2025-09-10 08:10:08] iteration 9548/ 11920 | consumed samples: 9777152 | elapsed time per iteration (ms): 5628.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862409E+00 | loss scale: 1.0 | grad norm: 0.270 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:42:29.713321 | finish at 2025-09-10 11:52:37 + [2025-09-10 08:10:13] iteration 9549/ 11920 | consumed samples: 9778176 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.864902E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:42:15.324974 | finish at 2025-09-10 11:52:28 + [2025-09-10 08:10:19] iteration 9550/ 11920 | consumed samples: 9779200 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860517E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:42:12.542274 | finish at 2025-09-10 11:52:31 + [2025-09-10 08:10:24] iteration 9551/ 11920 | consumed samples: 9780224 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860130E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:42:06.430989 | finish at 2025-09-10 11:52:31 + [2025-09-10 08:10:30] iteration 9552/ 11920 | consumed samples: 9781248 | elapsed time per iteration (ms): 5630.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849424E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:42:12.575348 | finish at 2025-09-10 11:52:43 + [2025-09-10 08:10:36] iteration 9553/ 11920 | consumed samples: 9782272 | elapsed time per iteration (ms): 5631.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862087E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:42:09.266155 | finish at 2025-09-10 11:52:45 + [2025-09-10 08:10:41] iteration 9554/ 11920 | consumed samples: 9783296 | elapsed time per iteration (ms): 5637.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858856E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:42:18.400702 | finish at 2025-09-10 11:53:00 + [2025-09-10 08:10:47] iteration 9555/ 11920 | consumed samples: 9784320 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856961E+00 | loss scale: 1.0 | grad norm: 0.260 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:41:44.589344 | finish at 2025-09-10 11:52:31 + [2025-09-10 08:10:53] iteration 9556/ 11920 | consumed samples: 9785344 | elapsed time per iteration (ms): 5628.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867490E+00 | loss scale: 1.0 | grad norm: 0.315 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:41:44.847934 | finish at 2025-09-10 11:52:37 + [2025-09-10 08:10:58] iteration 9557/ 11920 | consumed samples: 9786368 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854222E+00 | loss scale: 1.0 | grad norm: 0.372 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:41:25.824265 | finish at 2025-09-10 11:52:24 + [2025-09-10 08:11:04] iteration 9558/ 11920 | consumed samples: 9787392 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867525E+00 | loss scale: 1.0 | grad norm: 0.404 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:41:20.173669 | finish at 2025-09-10 11:52:24 + [2025-09-10 08:11:10] iteration 9559/ 11920 | consumed samples: 9788416 | elapsed time per iteration (ms): 5939.9 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.863569E+00 | loss scale: 1.0 | grad norm: 0.290 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:53:44.212918 | finish at 2025-09-10 12:04:54 + [2025-09-10 08:11:16] iteration 9560/ 11920 | consumed samples: 9789440 | elapsed time per iteration (ms): 5853.7 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850099E+00 | loss scale: 1.0 | grad norm: 0.274 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:50:14.713249 | finish at 2025-09-10 12:01:30 + [2025-09-10 08:11:21] iteration 9561/ 11920 | consumed samples: 9790464 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861544E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:41:06.192774 | finish at 2025-09-10 11:52:27 + [2025-09-10 08:11:27] iteration 9562/ 11920 | consumed samples: 9791488 | elapsed time per iteration (ms): 5862.8 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.869890E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:50:24.594562 | finish at 2025-09-10 12:01:52 + [2025-09-10 08:11:33] iteration 9563/ 11920 | consumed samples: 9792512 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.864592E+00 | loss scale: 1.0 | grad norm: 0.383 | num zeros: 10.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:41:00.111502 | finish at 2025-09-10 11:52:33 + [2025-09-10 08:11:38] iteration 9564/ 11920 | consumed samples: 9793536 | elapsed time per iteration (ms): 5636.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851444E+00 | loss scale: 1.0 | grad norm: 0.471 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:41:19.266242 | finish at 2025-09-10 11:52:58 + [2025-09-10 08:11:44] iteration 9565/ 11920 | consumed samples: 9794560 | elapsed time per iteration (ms): 5627.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.866406E+00 | loss scale: 1.0 | grad norm: 0.348 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:40:53.374082 | finish at 2025-09-10 11:52:37 + [2025-09-10 08:11:50] iteration 9566/ 11920 | consumed samples: 9795584 | elapsed time per iteration (ms): 5626.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.868021E+00 | loss scale: 1.0 | grad norm: 0.263 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:40:45.430657 | finish at 2025-09-10 11:52:35 + [2025-09-10 08:11:55] iteration 9567/ 11920 | consumed samples: 9796608 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.851783E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:40:30.523276 | finish at 2025-09-10 11:52:26 + [2025-09-10 08:12:01] iteration 9568/ 11920 | consumed samples: 9797632 | elapsed time per iteration (ms): 5618.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862293E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:40:15.709579 | finish at 2025-09-10 11:52:17 + [2025-09-10 08:12:07] iteration 9569/ 11920 | consumed samples: 9798656 | elapsed time per iteration (ms): 5874.1 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856660E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:50:10.075171 | finish at 2025-09-10 12:02:17 + [2025-09-10 08:12:12] iteration 9570/ 11920 | consumed samples: 9799680 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857611E+00 | loss scale: 1.0 | grad norm: 0.265 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:40:11.323440 | finish at 2025-09-10 11:52:24 + [2025-09-10 08:12:18] iteration 9571/ 11920 | consumed samples: 9800704 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860158E+00 | loss scale: 1.0 | grad norm: 0.255 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:40:01.149553 | finish at 2025-09-10 11:52:19 + [2025-09-10 08:12:24] iteration 9572/ 11920 | consumed samples: 9801728 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852944E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:40:03.586950 | finish at 2025-09-10 11:52:27 + [2025-09-10 08:12:29] iteration 9573/ 11920 | consumed samples: 9802752 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849787E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:40:02.137997 | finish at 2025-09-10 11:52:31 + [2025-09-10 08:12:35] iteration 9574/ 11920 | consumed samples: 9803776 | elapsed time per iteration (ms): 5631.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857603E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:40:12.218871 | finish at 2025-09-10 11:52:47 + [2025-09-10 08:12:40] iteration 9575/ 11920 | consumed samples: 9804800 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857362E+00 | loss scale: 1.0 | grad norm: 0.256 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:39:48.322102 | finish at 2025-09-10 11:52:29 + [2025-09-10 08:12:46] iteration 9576/ 11920 | consumed samples: 9805824 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845321E+00 | loss scale: 1.0 | grad norm: 0.309 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:39:42.637728 | finish at 2025-09-10 11:52:29 + [2025-09-10 08:12:52] iteration 9577/ 11920 | consumed samples: 9806848 | elapsed time per iteration (ms): 5638.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838595E+00 | loss scale: 1.0 | grad norm: 0.282 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:40:10.956284 | finish at 2025-09-10 11:53:03 + [2025-09-10 08:12:58] iteration 9578/ 11920 | consumed samples: 9807872 | elapsed time per iteration (ms): 5850.5 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843675E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:48:21.804245 | finish at 2025-09-10 12:01:19 + [2025-09-10 08:13:03] iteration 9579/ 11920 | consumed samples: 9808896 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845971E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:39:20.423813 | finish at 2025-09-10 11:52:24 + [2025-09-10 08:13:09] iteration 9580/ 11920 | consumed samples: 9809920 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843346E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:39:13.634977 | finish at 2025-09-10 11:52:22 + [2025-09-10 08:13:14] iteration 9581/ 11920 | consumed samples: 9810944 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841431E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:39:03.924995 | finish at 2025-09-10 11:52:18 + [2025-09-10 08:13:20] iteration 9582/ 11920 | consumed samples: 9811968 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858022E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:39:04.457805 | finish at 2025-09-10 11:52:24 + [2025-09-10 08:13:26] iteration 9583/ 11920 | consumed samples: 9812992 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848528E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:39:05.029370 | finish at 2025-09-10 11:52:31 + [2025-09-10 08:13:31] iteration 9584/ 11920 | consumed samples: 9814016 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.847336E+00 | loss scale: 1.0 | grad norm: 0.117 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:38:58.849350 | finish at 2025-09-10 11:52:30 + [2025-09-10 08:13:37] iteration 9585/ 11920 | consumed samples: 9815040 | elapsed time per iteration (ms): 5637.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850524E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:39:22.706952 | finish at 2025-09-10 11:53:00 + [2025-09-10 08:13:43] iteration 9586/ 11920 | consumed samples: 9816064 | elapsed time per iteration (ms): 5629.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842028E+00 | loss scale: 1.0 | grad norm: 0.128 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:39:00.111983 | finish at 2025-09-10 11:52:43 + [2025-09-10 08:13:48] iteration 9587/ 11920 | consumed samples: 9817088 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835625E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:38:34.368263 | finish at 2025-09-10 11:52:23 + [2025-09-10 08:13:54] iteration 9588/ 11920 | consumed samples: 9818112 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834621E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:38:30.010230 | finish at 2025-09-10 11:52:24 + [2025-09-10 08:14:00] iteration 9589/ 11920 | consumed samples: 9819136 | elapsed time per iteration (ms): 5961.2 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852470E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:51:35.463831 | finish at 2025-09-10 12:05:35 + [2025-09-10 08:14:05] iteration 9590/ 11920 | consumed samples: 9820160 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849316E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:38:32.057357 | finish at 2025-09-10 11:52:37 + [2025-09-10 08:14:11] iteration 9591/ 11920 | consumed samples: 9821184 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834502E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:38:18.555483 | finish at 2025-09-10 11:52:30 + [2025-09-10 08:14:17] iteration 9592/ 11920 | consumed samples: 9822208 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843895E+00 | loss scale: 1.0 | grad norm: 0.255 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:38:03.949184 | finish at 2025-09-10 11:52:21 + [2025-09-10 08:14:22] iteration 9593/ 11920 | consumed samples: 9823232 | elapsed time per iteration (ms): 5619.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848294E+00 | loss scale: 1.0 | grad norm: 0.267 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:37:56.834855 | finish at 2025-09-10 11:52:19 + [2025-09-10 08:14:28] iteration 9594/ 11920 | consumed samples: 9824256 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838616E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:38:02.639768 | finish at 2025-09-10 11:52:31 + [2025-09-10 08:14:34] iteration 9595/ 11920 | consumed samples: 9825280 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839017E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:37:54.881655 | finish at 2025-09-10 11:52:28 + [2025-09-10 08:14:39] iteration 9596/ 11920 | consumed samples: 9826304 | elapsed time per iteration (ms): 5630.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841376E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:38:05.302129 | finish at 2025-09-10 11:52:44 + [2025-09-10 08:14:45] iteration 9597/ 11920 | consumed samples: 9827328 | elapsed time per iteration (ms): 5626.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846081E+00 | loss scale: 1.0 | grad norm: 0.137 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:37:50.880971 | finish at 2025-09-10 11:52:36 + [2025-09-10 08:14:50] iteration 9598/ 11920 | consumed samples: 9828352 | elapsed time per iteration (ms): 5626.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857261E+00 | loss scale: 1.0 | grad norm: 0.132 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:37:45.284695 | finish at 2025-09-10 11:52:36 + [2025-09-10 08:14:56] iteration 9599/ 11920 | consumed samples: 9829376 | elapsed time per iteration (ms): 5957.6 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842809E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:50:27.558329 | finish at 2025-09-10 12:05:24 + [2025-09-10 08:15:02] iteration 9600/ 11920 | consumed samples: 9830400 | elapsed time per iteration (ms): 5615.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.848662E+00 | loss scale: 1.0 | grad norm: 0.121 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:37:08.948936 | finish at 2025-09-10 11:52:11 + [2025-09-10 08:15:08] iteration 9601/ 11920 | consumed samples: 9831424 | elapsed time per iteration (ms): 5866.5 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825622E+00 | loss scale: 1.0 | grad norm: 0.130 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:46:44.525507 | finish at 2025-09-10 12:01:52 + [2025-09-10 08:15:14] iteration 9602/ 11920 | consumed samples: 9832448 | elapsed time per iteration (ms): 5842.3 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839254E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:45:42.487179 | finish at 2025-09-10 12:00:56 + [2025-09-10 08:15:19] iteration 9603/ 11920 | consumed samples: 9833472 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826563E+00 | loss scale: 1.0 | grad norm: 0.126 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:37:04.634368 | finish at 2025-09-10 11:52:24 + [2025-09-10 08:15:25] iteration 9604/ 11920 | consumed samples: 9834496 | elapsed time per iteration (ms): 5631.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813517E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:37:21.874283 | finish at 2025-09-10 11:52:47 + [2025-09-10 08:15:31] iteration 9605/ 11920 | consumed samples: 9835520 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836824E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:36:54.973003 | finish at 2025-09-10 11:52:26 + [2025-09-10 08:15:36] iteration 9606/ 11920 | consumed samples: 9836544 | elapsed time per iteration (ms): 5626.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837060E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:37:00.522921 | finish at 2025-09-10 11:52:37 + [2025-09-10 08:15:42] iteration 9607/ 11920 | consumed samples: 9837568 | elapsed time per iteration (ms): 5947.6 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835989E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:49:16.735674 | finish at 2025-09-10 12:04:59 + [2025-09-10 08:15:48] iteration 9608/ 11920 | consumed samples: 9838592 | elapsed time per iteration (ms): 5917.1 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839501E+00 | loss scale: 1.0 | grad norm: 0.124 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:48:00.245792 | finish at 2025-09-10 12:03:48 + [2025-09-10 08:15:54] iteration 9609/ 11920 | consumed samples: 9839616 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822523E+00 | loss scale: 1.0 | grad norm: 0.127 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:36:36.690599 | finish at 2025-09-10 11:52:30 + [2025-09-10 08:15:59] iteration 9610/ 11920 | consumed samples: 9840640 | elapsed time per iteration (ms): 5622.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835400E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:36:28.230414 | finish at 2025-09-10 11:52:28 + [2025-09-10 08:16:05] iteration 9611/ 11920 | consumed samples: 9841664 | elapsed time per iteration (ms): 5630.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826992E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:36:40.454739 | finish at 2025-09-10 11:52:45 + [2025-09-10 08:16:11] iteration 9612/ 11920 | consumed samples: 9842688 | elapsed time per iteration (ms): 5921.9 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840539E+00 | loss scale: 1.0 | grad norm: 0.251 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:47:47.789850 | finish at 2025-09-10 12:03:59 + [2025-09-10 08:16:16] iteration 9613/ 11920 | consumed samples: 9843712 | elapsed time per iteration (ms): 5618.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854109E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:36:02.459220 | finish at 2025-09-10 11:52:19 + [2025-09-10 08:16:22] iteration 9614/ 11920 | consumed samples: 9844736 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832244E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:36:00.525183 | finish at 2025-09-10 11:52:23 + [2025-09-10 08:16:28] iteration 9615/ 11920 | consumed samples: 9845760 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827727E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:36:02.158958 | finish at 2025-09-10 11:52:30 + [2025-09-10 08:16:33] iteration 9616/ 11920 | consumed samples: 9846784 | elapsed time per iteration (ms): 5616.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850148E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:35:39.312744 | finish at 2025-09-10 11:52:13 + [2025-09-10 08:16:39] iteration 9617/ 11920 | consumed samples: 9847808 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831432E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:35:50.152041 | finish at 2025-09-10 11:52:29 + [2025-09-10 08:16:45] iteration 9618/ 11920 | consumed samples: 9848832 | elapsed time per iteration (ms): 5935.3 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831398E+00 | loss scale: 1.0 | grad norm: 0.137 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:47:43.024634 | finish at 2025-09-10 12:04:28 + [2025-09-10 08:16:50] iteration 9619/ 11920 | consumed samples: 9849856 | elapsed time per iteration (ms): 5615.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838656E+00 | loss scale: 1.0 | grad norm: 0.136 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:35:20.652651 | finish at 2025-09-10 11:52:11 + [2025-09-10 08:16:56] iteration 9620/ 11920 | consumed samples: 9850880 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824913E+00 | loss scale: 1.0 | grad norm: 0.126 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:35:37.889338 | finish at 2025-09-10 11:52:34 + [2025-09-10 08:17:02] iteration 9621/ 11920 | consumed samples: 9851904 | elapsed time per iteration (ms): 5618.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835140E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:35:16.869001 | finish at 2025-09-10 11:52:19 + [2025-09-10 08:17:08] iteration 9622/ 11920 | consumed samples: 9852928 | elapsed time per iteration (ms): 5882.9 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828726E+00 | loss scale: 1.0 | grad norm: 0.135 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:45:18.911870 | finish at 2025-09-10 12:02:27 + [2025-09-10 08:17:13] iteration 9623/ 11920 | consumed samples: 9853952 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820361E+00 | loss scale: 1.0 | grad norm: 0.120 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:35:15.598691 | finish at 2025-09-10 11:52:29 + [2025-09-10 08:17:19] iteration 9624/ 11920 | consumed samples: 9854976 | elapsed time per iteration (ms): 5618.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.833502E+00 | loss scale: 1.0 | grad norm: 0.121 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:35:00.540190 | finish at 2025-09-10 11:52:19 + [2025-09-10 08:17:25] iteration 9625/ 11920 | consumed samples: 9856000 | elapsed time per iteration (ms): 5893.6 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820030E+00 | loss scale: 1.0 | grad norm: 0.122 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:45:25.785159 | finish at 2025-09-10 12:02:51 + [2025-09-10 08:17:31] iteration 9626/ 11920 | consumed samples: 9857024 | elapsed time per iteration (ms): 5887.4 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841228E+00 | loss scale: 1.0 | grad norm: 0.125 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:45:05.739152 | finish at 2025-09-10 12:02:36 + [2025-09-10 08:17:37] iteration 9627/ 11920 | consumed samples: 9858048 | elapsed time per iteration (ms): 5998.7 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836038E+00 | loss scale: 1.0 | grad norm: 0.121 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:49:14.950542 | finish at 2025-09-10 12:06:52 + [2025-09-10 08:17:42] iteration 9628/ 11920 | consumed samples: 9859072 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819818E+00 | loss scale: 1.0 | grad norm: 0.122 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:34:41.370343 | finish at 2025-09-10 11:52:24 + [2025-09-10 08:17:48] iteration 9629/ 11920 | consumed samples: 9860096 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834356E+00 | loss scale: 1.0 | grad norm: 0.133 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:34:48.656214 | finish at 2025-09-10 11:52:37 + [2025-09-10 08:17:54] iteration 9630/ 11920 | consumed samples: 9861120 | elapsed time per iteration (ms): 5631.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846283E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:34:57.052257 | finish at 2025-09-10 11:52:51 + [2025-09-10 08:17:59] iteration 9631/ 11920 | consumed samples: 9862144 | elapsed time per iteration (ms): 5617.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827568E+00 | loss scale: 1.0 | grad norm: 0.137 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:34:18.978835 | finish at 2025-09-10 11:52:18 + [2025-09-10 08:18:05] iteration 9632/ 11920 | consumed samples: 9863168 | elapsed time per iteration (ms): 5935.9 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815528E+00 | loss scale: 1.0 | grad norm: 0.115 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:46:21.420963 | finish at 2025-09-10 12:04:26 + [2025-09-10 08:18:11] iteration 9633/ 11920 | consumed samples: 9864192 | elapsed time per iteration (ms): 5619.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825665E+00 | loss scale: 1.0 | grad norm: 0.124 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:34:12.074405 | finish at 2025-09-10 11:52:23 + [2025-09-10 08:18:16] iteration 9634/ 11920 | consumed samples: 9865216 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831098E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:34:10.099910 | finish at 2025-09-10 11:52:26 + [2025-09-10 08:18:22] iteration 9635/ 11920 | consumed samples: 9866240 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820235E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:34:05.306225 | finish at 2025-09-10 11:52:27 + [2025-09-10 08:18:28] iteration 9636/ 11920 | consumed samples: 9867264 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845540E+00 | loss scale: 1.0 | grad norm: 0.258 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:34:07.352427 | finish at 2025-09-10 11:52:35 + [2025-09-10 08:18:33] iteration 9637/ 11920 | consumed samples: 9868288 | elapsed time per iteration (ms): 5625.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831275E+00 | loss scale: 1.0 | grad norm: 0.247 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:34:04.020124 | finish at 2025-09-10 11:52:37 + [2025-09-10 08:18:39] iteration 9638/ 11920 | consumed samples: 9869312 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821465E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:33:53.233669 | finish at 2025-09-10 11:52:32 + [2025-09-10 08:18:44] iteration 9639/ 11920 | consumed samples: 9870336 | elapsed time per iteration (ms): 5617.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843053E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:33:33.493180 | finish at 2025-09-10 11:52:18 + [2025-09-10 08:18:50] iteration 9640/ 11920 | consumed samples: 9871360 | elapsed time per iteration (ms): 5615.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838767E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:33:23.211651 | finish at 2025-09-10 11:52:13 + [2025-09-10 08:18:56] iteration 9641/ 11920 | consumed samples: 9872384 | elapsed time per iteration (ms): 5627.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826519E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:33:44.184787 | finish at 2025-09-10 11:52:40 + [2025-09-10 08:19:01] iteration 9642/ 11920 | consumed samples: 9873408 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824811E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:33:21.812820 | finish at 2025-09-10 11:52:23 + [2025-09-10 08:19:07] iteration 9643/ 11920 | consumed samples: 9874432 | elapsed time per iteration (ms): 5958.0 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838453E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:46:06.374560 | finish at 2025-09-10 12:05:14 + [2025-09-10 08:19:13] iteration 9644/ 11920 | consumed samples: 9875456 | elapsed time per iteration (ms): 5617.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.847878E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:33:05.855040 | finish at 2025-09-10 11:52:19 + [2025-09-10 08:19:19] iteration 9645/ 11920 | consumed samples: 9876480 | elapsed time per iteration (ms): 6191.3 | throughput per GPU (TFLOP/s/GPU): 72.9 | MFU 7.37% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842812E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:54:45.160661 | finish at 2025-09-10 12:14:04 + [2025-09-10 08:19:25] iteration 9646/ 11920 | consumed samples: 9877504 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823028E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:32:59.708958 | finish at 2025-09-10 11:52:24 + [2025-09-10 08:19:30] iteration 9647/ 11920 | consumed samples: 9878528 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830086E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:33:11.400299 | finish at 2025-09-10 11:52:42 + [2025-09-10 08:19:36] iteration 9648/ 11920 | consumed samples: 9879552 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826790E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:33:02.407257 | finish at 2025-09-10 11:52:38 + [2025-09-10 08:19:42] iteration 9649/ 11920 | consumed samples: 9880576 | elapsed time per iteration (ms): 5619.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815142E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:32:40.979562 | finish at 2025-09-10 11:52:23 + [2025-09-10 08:19:47] iteration 9650/ 11920 | consumed samples: 9881600 | elapsed time per iteration (ms): 5909.7 | throughput per GPU (TFLOP/s/GPU): 76.4 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824798E+00 | loss scale: 1.0 | grad norm: 0.134 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:43:35.084839 | finish at 2025-09-10 12:03:23 + [2025-09-10 08:19:53] iteration 9651/ 11920 | consumed samples: 9882624 | elapsed time per iteration (ms): 5622.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842315E+00 | loss scale: 1.0 | grad norm: 0.129 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:32:37.528106 | finish at 2025-09-10 11:52:31 + [2025-09-10 08:19:59] iteration 9652/ 11920 | consumed samples: 9883648 | elapsed time per iteration (ms): 5819.7 | throughput per GPU (TFLOP/s/GPU): 77.6 | MFU 7.84% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827128E+00 | loss scale: 1.0 | grad norm: 0.115 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:39:59.047162 | finish at 2025-09-10 11:59:58 + [2025-09-10 08:20:05] iteration 9653/ 11920 | consumed samples: 9884672 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831985E+00 | loss scale: 1.0 | grad norm: 0.124 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:32:20.380835 | finish at 2025-09-10 11:52:25 + [2025-09-10 08:20:10] iteration 9654/ 11920 | consumed samples: 9885696 | elapsed time per iteration (ms): 5977.1 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823519E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:45:43.997611 | finish at 2025-09-10 12:05:54 + [2025-09-10 08:20:16] iteration 9655/ 11920 | consumed samples: 9886720 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816534E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:32:13.043686 | finish at 2025-09-10 11:52:29 + [2025-09-10 08:20:22] iteration 9656/ 11920 | consumed samples: 9887744 | elapsed time per iteration (ms): 5615.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834733E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:31:52.952160 | finish at 2025-09-10 11:52:15 + [2025-09-10 08:20:27] iteration 9657/ 11920 | consumed samples: 9888768 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822941E+00 | loss scale: 1.0 | grad norm: 0.142 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:32:05.303082 | finish at 2025-09-10 11:52:33 + [2025-09-10 08:20:33] iteration 9658/ 11920 | consumed samples: 9889792 | elapsed time per iteration (ms): 5615.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829102E+00 | loss scale: 1.0 | grad norm: 0.135 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:31:41.244354 | finish at 2025-09-10 11:52:14 + [2025-09-10 08:20:39] iteration 9659/ 11920 | consumed samples: 9890816 | elapsed time per iteration (ms): 5627.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819512E+00 | loss scale: 1.0 | grad norm: 0.134 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:32:04.572749 | finish at 2025-09-10 11:52:43 + [2025-09-10 08:20:44] iteration 9660/ 11920 | consumed samples: 9891840 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814182E+00 | loss scale: 1.0 | grad norm: 0.134 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:31:40.003009 | finish at 2025-09-10 11:52:24 + [2025-09-10 08:20:50] iteration 9661/ 11920 | consumed samples: 9892864 | elapsed time per iteration (ms): 5616.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820778E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:31:28.605032 | finish at 2025-09-10 11:52:18 + [2025-09-10 08:20:55] iteration 9662/ 11920 | consumed samples: 9893888 | elapsed time per iteration (ms): 5628.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814478E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:31:48.326061 | finish at 2025-09-10 11:52:44 + [2025-09-10 08:21:01] iteration 9663/ 11920 | consumed samples: 9894912 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830870E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:31:26.431378 | finish at 2025-09-10 11:52:28 + [2025-09-10 08:21:07] iteration 9664/ 11920 | consumed samples: 9895936 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829129E+00 | loss scale: 1.0 | grad norm: 0.253 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:31:26.373665 | finish at 2025-09-10 11:52:33 + [2025-09-10 08:21:12] iteration 9665/ 11920 | consumed samples: 9896960 | elapsed time per iteration (ms): 5618.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828625E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:31:08.826090 | finish at 2025-09-10 11:52:21 + [2025-09-10 08:21:18] iteration 9666/ 11920 | consumed samples: 9897984 | elapsed time per iteration (ms): 5617.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819987E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:31:01.742508 | finish at 2025-09-10 11:52:20 + [2025-09-10 08:21:24] iteration 9667/ 11920 | consumed samples: 9899008 | elapsed time per iteration (ms): 5626.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826804E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:31:16.304433 | finish at 2025-09-10 11:52:40 + [2025-09-10 08:21:29] iteration 9668/ 11920 | consumed samples: 9900032 | elapsed time per iteration (ms): 5630.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829625E+00 | loss scale: 1.0 | grad norm: 0.137 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:31:19.121067 | finish at 2025-09-10 11:52:48 + [2025-09-10 08:21:35] iteration 9669/ 11920 | consumed samples: 9901056 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826594E+00 | loss scale: 1.0 | grad norm: 0.129 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:30:55.713374 | finish at 2025-09-10 11:52:31 + [2025-09-10 08:21:40] iteration 9670/ 11920 | consumed samples: 9902080 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824993E+00 | loss scale: 1.0 | grad norm: 0.128 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:30:53.450847 | finish at 2025-09-10 11:52:34 + [2025-09-10 08:21:46] iteration 9671/ 11920 | consumed samples: 9903104 | elapsed time per iteration (ms): 5616.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832823E+00 | loss scale: 1.0 | grad norm: 0.117 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:30:30.421393 | finish at 2025-09-10 11:52:16 + [2025-09-10 08:21:52] iteration 9672/ 11920 | consumed samples: 9904128 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826175E+00 | loss scale: 1.0 | grad norm: 0.113 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:30:34.621038 | finish at 2025-09-10 11:52:26 + [2025-09-10 08:21:58] iteration 9673/ 11920 | consumed samples: 9905152 | elapsed time per iteration (ms): 5828.3 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841068E+00 | loss scale: 1.0 | grad norm: 0.123 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:38:16.115097 | finish at 2025-09-10 12:00:14 + [2025-09-10 08:22:03] iteration 9674/ 11920 | consumed samples: 9906176 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815754E+00 | loss scale: 1.0 | grad norm: 0.121 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:30:22.820687 | finish at 2025-09-10 11:52:26 + [2025-09-10 08:22:09] iteration 9675/ 11920 | consumed samples: 9907200 | elapsed time per iteration (ms): 5616.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825061E+00 | loss scale: 1.0 | grad norm: 0.112 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:30:07.933240 | finish at 2025-09-10 11:52:17 + [2025-09-10 08:22:14] iteration 9676/ 11920 | consumed samples: 9908224 | elapsed time per iteration (ms): 5616.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814373E+00 | loss scale: 1.0 | grad norm: 0.126 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:30:03.856997 | finish at 2025-09-10 11:52:18 + [2025-09-10 08:22:20] iteration 9677/ 11920 | consumed samples: 9909248 | elapsed time per iteration (ms): 5618.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827799E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:30:02.676780 | finish at 2025-09-10 11:52:23 + [2025-09-10 08:22:26] iteration 9678/ 11920 | consumed samples: 9910272 | elapsed time per iteration (ms): 5626.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818677E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:30:15.523603 | finish at 2025-09-10 11:52:41 + [2025-09-10 08:22:32] iteration 9679/ 11920 | consumed samples: 9911296 | elapsed time per iteration (ms): 5931.4 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830795E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:41:32.224258 | finish at 2025-09-10 12:04:04 + [2025-09-10 08:22:37] iteration 9680/ 11920 | consumed samples: 9912320 | elapsed time per iteration (ms): 5629.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811803E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:30:09.337463 | finish at 2025-09-10 11:52:47 + [2025-09-10 08:22:43] iteration 9681/ 11920 | consumed samples: 9913344 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830853E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:29:46.458995 | finish at 2025-09-10 11:52:29 + [2025-09-10 08:22:48] iteration 9682/ 11920 | consumed samples: 9914368 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843970E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:29:37.922045 | finish at 2025-09-10 11:52:26 + [2025-09-10 08:22:54] iteration 9683/ 11920 | consumed samples: 9915392 | elapsed time per iteration (ms): 5631.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837300E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:29:57.757248 | finish at 2025-09-10 11:52:52 + [2025-09-10 08:23:00] iteration 9684/ 11920 | consumed samples: 9916416 | elapsed time per iteration (ms): 5631.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832178E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:29:52.209404 | finish at 2025-09-10 11:52:52 + [2025-09-10 08:23:05] iteration 9685/ 11920 | consumed samples: 9917440 | elapsed time per iteration (ms): 5614.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816085E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:29:08.590908 | finish at 2025-09-10 11:52:14 + [2025-09-10 08:23:11] iteration 9686/ 11920 | consumed samples: 9918464 | elapsed time per iteration (ms): 5946.0 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821330E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:41:23.282730 | finish at 2025-09-10 12:04:35 + [2025-09-10 08:23:17] iteration 9687/ 11920 | consumed samples: 9919488 | elapsed time per iteration (ms): 5986.8 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825898E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:42:48.431665 | finish at 2025-09-10 12:06:06 + [2025-09-10 08:23:23] iteration 9688/ 11920 | consumed samples: 9920512 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829308E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:29:03.851452 | finish at 2025-09-10 11:52:27 + [2025-09-10 08:23:28] iteration 9689/ 11920 | consumed samples: 9921536 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832982E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:29:09.726062 | finish at 2025-09-10 11:52:38 + [2025-09-10 08:23:35] iteration 9690/ 11920 | consumed samples: 9922560 | elapsed time per iteration (ms): 6367.3 | throughput per GPU (TFLOP/s/GPU): 70.9 | MFU 7.17% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.834271E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:56:39.153509 | finish at 2025-09-10 12:20:14 + [2025-09-10 08:23:40] iteration 9691/ 11920 | consumed samples: 9923584 | elapsed time per iteration (ms): 5634.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824715E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:29:20.255017 | finish at 2025-09-10 11:53:01 + [2025-09-10 08:23:46] iteration 9692/ 11920 | consumed samples: 9924608 | elapsed time per iteration (ms): 5623.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823545E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:28:48.240866 | finish at 2025-09-10 11:52:34 + [2025-09-10 08:23:52] iteration 9693/ 11920 | consumed samples: 9925632 | elapsed time per iteration (ms): 5956.3 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825633E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:41:04.601563 | finish at 2025-09-10 12:04:57 + [2025-09-10 08:23:58] iteration 9694/ 11920 | consumed samples: 9926656 | elapsed time per iteration (ms): 5999.8 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821341E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:42:35.655032 | finish at 2025-09-10 12:06:34 + [2025-09-10 08:24:04] iteration 9695/ 11920 | consumed samples: 9927680 | elapsed time per iteration (ms): 5994.9 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828141E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:42:18.603139 | finish at 2025-09-10 12:06:23 + [2025-09-10 08:24:10] iteration 9696/ 11920 | consumed samples: 9928704 | elapsed time per iteration (ms): 5614.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826566E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:28:06.123692 | finish at 2025-09-10 11:52:16 + [2025-09-10 08:24:15] iteration 9697/ 11920 | consumed samples: 9929728 | elapsed time per iteration (ms): 5618.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822869E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:28:09.543884 | finish at 2025-09-10 11:52:25 + [2025-09-10 08:24:21] iteration 9698/ 11920 | consumed samples: 9930752 | elapsed time per iteration (ms): 5618.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832683E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:28:03.864103 | finish at 2025-09-10 11:52:25 + [2025-09-10 08:24:27] iteration 9699/ 11920 | consumed samples: 9931776 | elapsed time per iteration (ms): 5615.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.833370E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:27:52.936758 | finish at 2025-09-10 11:52:19 + [2025-09-10 08:24:32] iteration 9700/ 11920 | consumed samples: 9932800 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828240E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:28:03.842611 | finish at 2025-09-10 11:52:36 + [2025-09-10 08:24:38] iteration 9701/ 11920 | consumed samples: 9933824 | elapsed time per iteration (ms): 5959.8 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835492E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:40:24.897260 | finish at 2025-09-10 12:05:03 + [2025-09-10 08:24:44] iteration 9702/ 11920 | consumed samples: 9934848 | elapsed time per iteration (ms): 5838.9 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823977E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:35:50.655544 | finish at 2025-09-10 12:00:35 + [2025-09-10 08:24:50] iteration 9703/ 11920 | consumed samples: 9935872 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830919E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:27:39.758576 | finish at 2025-09-10 11:52:29 + [2025-09-10 08:24:55] iteration 9704/ 11920 | consumed samples: 9936896 | elapsed time per iteration (ms): 5627.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838689E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:27:50.977060 | finish at 2025-09-10 11:52:46 + [2025-09-10 08:25:01] iteration 9705/ 11920 | consumed samples: 9937920 | elapsed time per iteration (ms): 5614.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816985E+00 | loss scale: 1.0 | grad norm: 0.248 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:27:16.213187 | finish at 2025-09-10 11:52:17 + [2025-09-10 08:25:06] iteration 9706/ 11920 | consumed samples: 9938944 | elapsed time per iteration (ms): 5614.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829031E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:27:09.555067 | finish at 2025-09-10 11:52:16 + [2025-09-10 08:25:12] iteration 9707/ 11920 | consumed samples: 9939968 | elapsed time per iteration (ms): 5617.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818378E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:27:12.257347 | finish at 2025-09-10 11:52:24 + [2025-09-10 08:25:18] iteration 9708/ 11920 | consumed samples: 9940992 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836476E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:27:11.772525 | finish at 2025-09-10 11:52:29 + [2025-09-10 08:25:23] iteration 9709/ 11920 | consumed samples: 9942016 | elapsed time per iteration (ms): 5618.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826289E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:27:01.866171 | finish at 2025-09-10 11:52:25 + [2025-09-10 08:25:29] iteration 9710/ 11920 | consumed samples: 9943040 | elapsed time per iteration (ms): 5626.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816556E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:27:14.837170 | finish at 2025-09-10 11:52:44 + [2025-09-10 08:25:35] iteration 9711/ 11920 | consumed samples: 9944064 | elapsed time per iteration (ms): 5619.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.833979E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:26:52.349314 | finish at 2025-09-10 11:52:27 + [2025-09-10 08:25:40] iteration 9712/ 11920 | consumed samples: 9945088 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819711E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:26:56.831955 | finish at 2025-09-10 11:52:37 + [2025-09-10 08:25:46] iteration 9713/ 11920 | consumed samples: 9946112 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819170E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:26:43.317121 | finish at 2025-09-10 11:52:29 + [2025-09-10 08:25:51] iteration 9714/ 11920 | consumed samples: 9947136 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821954E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:26:45.698956 | finish at 2025-09-10 11:52:37 + [2025-09-10 08:25:57] iteration 9715/ 11920 | consumed samples: 9948160 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826072E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:26:36.551485 | finish at 2025-09-10 11:52:34 + [2025-09-10 08:26:03] iteration 9716/ 11920 | consumed samples: 9949184 | elapsed time per iteration (ms): 5616.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831419E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:26:19.149903 | finish at 2025-09-10 11:52:22 + [2025-09-10 08:26:08] iteration 9717/ 11920 | consumed samples: 9950208 | elapsed time per iteration (ms): 5836.6 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818245E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:34:17.984773 | finish at 2025-09-10 12:00:26 + [2025-09-10 08:26:14] iteration 9718/ 11920 | consumed samples: 9951232 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836040E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:26:14.890625 | finish at 2025-09-10 11:52:29 + [2025-09-10 08:26:20] iteration 9719/ 11920 | consumed samples: 9952256 | elapsed time per iteration (ms): 5633.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820683E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:26:39.254480 | finish at 2025-09-10 11:52:59 + [2025-09-10 08:26:25] iteration 9720/ 11920 | consumed samples: 9953280 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825221E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:26:15.998163 | finish at 2025-09-10 11:52:41 + [2025-09-10 08:26:31] iteration 9721/ 11920 | consumed samples: 9954304 | elapsed time per iteration (ms): 5864.6 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822747E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:34:56.223294 | finish at 2025-09-10 12:01:27 + [2025-09-10 08:26:37] iteration 9722/ 11920 | consumed samples: 9955328 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840634E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:26:08.802833 | finish at 2025-09-10 11:52:46 + [2025-09-10 08:26:42] iteration 9723/ 11920 | consumed samples: 9956352 | elapsed time per iteration (ms): 5617.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815168E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:25:42.487831 | finish at 2025-09-10 11:52:25 + [2025-09-10 08:26:48] iteration 9724/ 11920 | consumed samples: 9957376 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813093E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:25:52.801575 | finish at 2025-09-10 11:52:41 + [2025-09-10 08:26:54] iteration 9725/ 11920 | consumed samples: 9958400 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819662E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:25:40.707570 | finish at 2025-09-10 11:52:34 + [2025-09-10 08:26:59] iteration 9726/ 11920 | consumed samples: 9959424 | elapsed time per iteration (ms): 5634.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827202E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:26:01.184974 | finish at 2025-09-10 11:53:01 + [2025-09-10 08:27:05] iteration 9727/ 11920 | consumed samples: 9960448 | elapsed time per iteration (ms): 5628.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826113E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:25:42.217640 | finish at 2025-09-10 11:52:47 + [2025-09-10 08:27:11] iteration 9728/ 11920 | consumed samples: 9961472 | elapsed time per iteration (ms): 5618.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824135E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:25:16.308571 | finish at 2025-09-10 11:52:27 + [2025-09-10 08:27:16] iteration 9729/ 11920 | consumed samples: 9962496 | elapsed time per iteration (ms): 5837.8 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824549E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:33:10.682019 | finish at 2025-09-10 12:00:27 + [2025-09-10 08:27:22] iteration 9730/ 11920 | consumed samples: 9963520 | elapsed time per iteration (ms): 5625.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815975E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:25:20.662587 | finish at 2025-09-10 11:52:43 + [2025-09-10 08:27:28] iteration 9731/ 11920 | consumed samples: 9964544 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815323E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:25:11.570787 | finish at 2025-09-10 11:52:39 + [2025-09-10 08:27:33] iteration 9732/ 11920 | consumed samples: 9965568 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821743E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:25:00.081475 | finish at 2025-09-10 11:52:33 + [2025-09-10 08:27:39] iteration 9733/ 11920 | consumed samples: 9966592 | elapsed time per iteration (ms): 5928.0 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844125E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:36:04.615719 | finish at 2025-09-10 12:03:44 + [2025-09-10 08:27:45] iteration 9734/ 11920 | consumed samples: 9967616 | elapsed time per iteration (ms): 5952.0 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821299E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:36:51.129639 | finish at 2025-09-10 12:04:36 + [2025-09-10 08:27:51] iteration 9735/ 11920 | consumed samples: 9968640 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831445E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:24:47.637383 | finish at 2025-09-10 11:52:38 + [2025-09-10 08:27:56] iteration 9736/ 11920 | consumed samples: 9969664 | elapsed time per iteration (ms): 5629.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820364E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:24:55.465153 | finish at 2025-09-10 11:52:52 + [2025-09-10 08:28:02] iteration 9737/ 11920 | consumed samples: 9970688 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813454E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:24:42.242777 | finish at 2025-09-10 11:52:44 + [2025-09-10 08:28:08] iteration 9738/ 11920 | consumed samples: 9971712 | elapsed time per iteration (ms): 5617.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801336E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:24:17.954277 | finish at 2025-09-10 11:52:26 + [2025-09-10 08:28:13] iteration 9739/ 11920 | consumed samples: 9972736 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820309E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:24:23.033769 | finish at 2025-09-10 11:52:36 + [2025-09-10 08:28:19] iteration 9740/ 11920 | consumed samples: 9973760 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832136E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 10.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:24:17.448525 | finish at 2025-09-10 11:52:36 + [2025-09-10 08:28:25] iteration 9741/ 11920 | consumed samples: 9974784 | elapsed time per iteration (ms): 5875.4 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824614E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:33:22.572074 | finish at 2025-09-10 12:01:47 + [2025-09-10 08:28:30] iteration 9742/ 11920 | consumed samples: 9975808 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821836E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:24:01.081544 | finish at 2025-09-10 11:52:31 + [2025-09-10 08:28:36] iteration 9743/ 11920 | consumed samples: 9976832 | elapsed time per iteration (ms): 5625.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813545E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:24:07.577099 | finish at 2025-09-10 11:52:44 + [2025-09-10 08:28:42] iteration 9744/ 11920 | consumed samples: 9977856 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806765E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:23:56.373596 | finish at 2025-09-10 11:52:38 + [2025-09-10 08:28:47] iteration 9745/ 11920 | consumed samples: 9978880 | elapsed time per iteration (ms): 5635.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822346E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:24:16.668949 | finish at 2025-09-10 11:53:04 + [2025-09-10 08:28:53] iteration 9746/ 11920 | consumed samples: 9979904 | elapsed time per iteration (ms): 5632.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815357E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:24:04.750600 | finish at 2025-09-10 11:52:58 + [2025-09-10 08:28:59] iteration 9747/ 11920 | consumed samples: 9980928 | elapsed time per iteration (ms): 5616.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835746E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:23:25.411723 | finish at 2025-09-10 11:52:24 + [2025-09-10 08:29:04] iteration 9748/ 11920 | consumed samples: 9981952 | elapsed time per iteration (ms): 5619.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829009E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:23:25.093466 | finish at 2025-09-10 11:52:29 + [2025-09-10 08:29:10] iteration 9749/ 11920 | consumed samples: 9982976 | elapsed time per iteration (ms): 5909.8 | throughput per GPU (TFLOP/s/GPU): 76.4 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815853E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:33:50.130883 | finish at 2025-09-10 12:03:00 + [2025-09-10 08:29:16] iteration 9750/ 11920 | consumed samples: 9984000 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827193E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:23:16.569521 | finish at 2025-09-10 11:52:32 + [2025-09-10 08:29:21] iteration 9751/ 11920 | consumed samples: 9985024 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812582E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:23:09.822674 | finish at 2025-09-10 11:52:31 + [2025-09-10 08:29:27] iteration 9752/ 11920 | consumed samples: 9986048 | elapsed time per iteration (ms): 5956.0 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832366E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:35:12.558056 | finish at 2025-09-10 12:04:40 + [2025-09-10 08:29:33] iteration 9753/ 11920 | consumed samples: 9987072 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822710E+00 | loss scale: 1.0 | grad norm: 0.132 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:23:05.604983 | finish at 2025-09-10 11:52:39 + [2025-09-10 08:29:39] iteration 9754/ 11920 | consumed samples: 9988096 | elapsed time per iteration (ms): 5935.8 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819203E+00 | loss scale: 1.0 | grad norm: 0.122 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:34:16.855690 | finish at 2025-09-10 12:03:56 + [2025-09-10 08:29:45] iteration 9755/ 11920 | consumed samples: 9989120 | elapsed time per iteration (ms): 5942.8 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816123E+00 | loss scale: 1.0 | grad norm: 0.125 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:34:26.182228 | finish at 2025-09-10 12:04:11 + [2025-09-10 08:29:50] iteration 9756/ 11920 | consumed samples: 9990144 | elapsed time per iteration (ms): 5618.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818139E+00 | loss scale: 1.0 | grad norm: 0.125 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:22:38.148158 | finish at 2025-09-10 11:52:29 + [2025-09-10 08:29:56] iteration 9757/ 11920 | consumed samples: 9991168 | elapsed time per iteration (ms): 5621.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820176E+00 | loss scale: 1.0 | grad norm: 0.124 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:22:38.271587 | finish at 2025-09-10 11:52:34 + [2025-09-10 08:30:02] iteration 9758/ 11920 | consumed samples: 9992192 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817979E+00 | loss scale: 1.0 | grad norm: 0.120 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:22:40.132481 | finish at 2025-09-10 11:52:42 + [2025-09-10 08:30:07] iteration 9759/ 11920 | consumed samples: 9993216 | elapsed time per iteration (ms): 5618.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813936E+00 | loss scale: 1.0 | grad norm: 0.128 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:22:22.301345 | finish at 2025-09-10 11:52:30 + [2025-09-10 08:30:13] iteration 9760/ 11920 | consumed samples: 9994240 | elapsed time per iteration (ms): 6235.1 | throughput per GPU (TFLOP/s/GPU): 72.4 | MFU 7.32% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800833E+00 | loss scale: 1.0 | grad norm: 0.133 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:44:27.843876 | finish at 2025-09-10 12:14:41 + [2025-09-10 08:30:19] iteration 9761/ 11920 | consumed samples: 9995264 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825414E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:22:12.451945 | finish at 2025-09-10 11:52:32 + [2025-09-10 08:30:25] iteration 9762/ 11920 | consumed samples: 9996288 | elapsed time per iteration (ms): 5966.7 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822696E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:34:36.206681 | finish at 2025-09-10 12:05:01 + [2025-09-10 08:30:31] iteration 9763/ 11920 | consumed samples: 9997312 | elapsed time per iteration (ms): 5635.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817718E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:22:36.241900 | finish at 2025-09-10 11:53:07 + [2025-09-10 08:30:37] iteration 9764/ 11920 | consumed samples: 9998336 | elapsed time per iteration (ms): 5837.1 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814850E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:29:44.831869 | finish at 2025-09-10 12:00:21 + [2025-09-10 08:30:42] iteration 9765/ 11920 | consumed samples: 9999360 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819684E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:21:56.847545 | finish at 2025-09-10 11:52:39 + [2025-09-10 08:30:48] iteration 9766/ 11920 | consumed samples: 10000384 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823578E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:21:49.768440 | finish at 2025-09-10 11:52:38 + [2025-09-10 08:30:54] iteration 9767/ 11920 | consumed samples: 10001408 | elapsed time per iteration (ms): 5885.1 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816694E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:31:10.684223 | finish at 2025-09-10 12:02:04 + [2025-09-10 08:30:59] iteration 9768/ 11920 | consumed samples: 10002432 | elapsed time per iteration (ms): 5619.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817060E+00 | loss scale: 1.0 | grad norm: 0.250 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:21:33.526577 | finish at 2025-09-10 11:52:33 + [2025-09-10 08:31:05] iteration 9769/ 11920 | consumed samples: 10003456 | elapsed time per iteration (ms): 5615.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828207E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:21:18.803515 | finish at 2025-09-10 11:52:24 + [2025-09-10 08:31:11] iteration 9770/ 11920 | consumed samples: 10004480 | elapsed time per iteration (ms): 5964.2 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837705E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:33:42.997749 | finish at 2025-09-10 12:04:54 + [2025-09-10 08:31:17] iteration 9771/ 11920 | consumed samples: 10005504 | elapsed time per iteration (ms): 5618.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823625E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:21:13.496054 | finish at 2025-09-10 11:52:30 + [2025-09-10 08:31:22] iteration 9772/ 11920 | consumed samples: 10006528 | elapsed time per iteration (ms): 5933.2 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820239E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:32:24.521029 | finish at 2025-09-10 12:03:47 + [2025-09-10 08:31:28] iteration 9773/ 11920 | consumed samples: 10007552 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815832E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:21:10.696040 | finish at 2025-09-10 11:52:39 + [2025-09-10 08:31:34] iteration 9774/ 11920 | consumed samples: 10008576 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820527E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:21:13.614317 | finish at 2025-09-10 11:52:47 + [2025-09-10 08:31:39] iteration 9775/ 11920 | consumed samples: 10009600 | elapsed time per iteration (ms): 5632.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820730E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:21:20.857794 | finish at 2025-09-10 11:53:00 + [2025-09-10 08:31:45] iteration 9776/ 11920 | consumed samples: 10010624 | elapsed time per iteration (ms): 5619.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821535E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:20:47.897552 | finish at 2025-09-10 11:52:33 + [2025-09-10 08:31:51] iteration 9777/ 11920 | consumed samples: 10011648 | elapsed time per iteration (ms): 5617.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810449E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:20:38.831457 | finish at 2025-09-10 11:52:29 + [2025-09-10 08:31:56] iteration 9778/ 11920 | consumed samples: 10012672 | elapsed time per iteration (ms): 5858.9 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837587E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:29:09.757311 | finish at 2025-09-10 12:01:06 + [2025-09-10 08:32:02] iteration 9779/ 11920 | consumed samples: 10013696 | elapsed time per iteration (ms): 5616.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817927E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:20:24.250957 | finish at 2025-09-10 11:52:26 + [2025-09-10 08:32:08] iteration 9780/ 11920 | consumed samples: 10014720 | elapsed time per iteration (ms): 5617.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818219E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:20:21.644535 | finish at 2025-09-10 11:52:29 + [2025-09-10 08:32:13] iteration 9781/ 11920 | consumed samples: 10015744 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814814E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:20:23.180906 | finish at 2025-09-10 11:52:36 + [2025-09-10 08:32:19] iteration 9782/ 11920 | consumed samples: 10016768 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818911E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:20:16.550178 | finish at 2025-09-10 11:52:35 + [2025-09-10 08:32:25] iteration 9783/ 11920 | consumed samples: 10017792 | elapsed time per iteration (ms): 5617.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825331E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:20:04.739284 | finish at 2025-09-10 11:52:29 + [2025-09-10 08:32:30] iteration 9784/ 11920 | consumed samples: 10018816 | elapsed time per iteration (ms): 5642.9 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824284E+00 | loss scale: 1.0 | grad norm: 0.142 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:20:53.272573 | finish at 2025-09-10 11:53:23 + [2025-09-10 08:32:36] iteration 9785/ 11920 | consumed samples: 10019840 | elapsed time per iteration (ms): 5630.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835724E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:20:21.045895 | finish at 2025-09-10 11:52:57 + [2025-09-10 08:32:41] iteration 9786/ 11920 | consumed samples: 10020864 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827850E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:19:53.806301 | finish at 2025-09-10 11:52:35 + [2025-09-10 08:32:47] iteration 9787/ 11920 | consumed samples: 10021888 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811515E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:19:51.587122 | finish at 2025-09-10 11:52:39 + [2025-09-10 08:32:53] iteration 9788/ 11920 | consumed samples: 10022912 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822876E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:19:44.653243 | finish at 2025-09-10 11:52:37 + [2025-09-10 08:32:58] iteration 9789/ 11920 | consumed samples: 10023936 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818349E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:19:40.402188 | finish at 2025-09-10 11:52:39 + [2025-09-10 08:33:04] iteration 9790/ 11920 | consumed samples: 10024960 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826585E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:19:30.585537 | finish at 2025-09-10 11:52:34 + [2025-09-10 08:33:10] iteration 9791/ 11920 | consumed samples: 10025984 | elapsed time per iteration (ms): 5868.4 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807645E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:28:13.773047 | finish at 2025-09-10 12:01:24 + [2025-09-10 08:33:15] iteration 9792/ 11920 | consumed samples: 10027008 | elapsed time per iteration (ms): 5618.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823257E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:19:15.023903 | finish at 2025-09-10 11:52:30 + [2025-09-10 08:33:21] iteration 9793/ 11920 | consumed samples: 10028032 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811403E+00 | loss scale: 1.0 | grad norm: 0.250 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:19:24.195481 | finish at 2025-09-10 11:52:45 + [2025-09-10 08:33:27] iteration 9794/ 11920 | consumed samples: 10029056 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817609E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:19:12.443932 | finish at 2025-09-10 11:52:39 + [2025-09-10 08:33:32] iteration 9795/ 11920 | consumed samples: 10030080 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838330E+00 | loss scale: 1.0 | grad norm: 0.251 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:19:05.469677 | finish at 2025-09-10 11:52:38 + [2025-09-10 08:33:38] iteration 9796/ 11920 | consumed samples: 10031104 | elapsed time per iteration (ms): 5626.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820371E+00 | loss scale: 1.0 | grad norm: 0.247 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:19:10.662981 | finish at 2025-09-10 11:52:49 + [2025-09-10 08:33:44] iteration 9797/ 11920 | consumed samples: 10032128 | elapsed time per iteration (ms): 5817.1 | throughput per GPU (TFLOP/s/GPU): 77.6 | MFU 7.85% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822357E+00 | loss scale: 1.0 | grad norm: 0.267 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:25:49.708970 | finish at 2025-09-10 11:59:33 + [2025-09-10 08:33:49] iteration 9798/ 11920 | consumed samples: 10033152 | elapsed time per iteration (ms): 5618.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830574E+00 | loss scale: 1.0 | grad norm: 0.280 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:18:41.706196 | finish at 2025-09-10 11:52:31 + [2025-09-10 08:33:55] iteration 9799/ 11920 | consumed samples: 10034176 | elapsed time per iteration (ms): 5636.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828175E+00 | loss scale: 1.0 | grad norm: 0.254 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:19:15.520924 | finish at 2025-09-10 11:53:10 + [2025-09-10 08:34:01] iteration 9800/ 11920 | consumed samples: 10035200 | elapsed time per iteration (ms): 5615.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825378E+00 | loss scale: 1.0 | grad norm: 0.230 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:18:23.708534 | finish at 2025-09-10 11:52:24 + [2025-09-10 08:34:07] iteration 9801/ 11920 | consumed samples: 10036224 | elapsed time per iteration (ms): 5962.5 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822512E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:30:34.547301 | finish at 2025-09-10 12:04:41 + [2025-09-10 08:34:12] iteration 9802/ 11920 | consumed samples: 10037248 | elapsed time per iteration (ms): 5615.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814851E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:18:14.143003 | finish at 2025-09-10 11:52:26 + [2025-09-10 08:34:18] iteration 9803/ 11920 | consumed samples: 10038272 | elapsed time per iteration (ms): 5965.3 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828215E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:30:28.461033 | finish at 2025-09-10 12:04:47 + [2025-09-10 08:34:24] iteration 9804/ 11920 | consumed samples: 10039296 | elapsed time per iteration (ms): 5837.5 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820853E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:25:52.228398 | finish at 2025-09-10 12:00:16 + [2025-09-10 08:34:30] iteration 9805/ 11920 | consumed samples: 10040320 | elapsed time per iteration (ms): 5628.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808095E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:18:24.284527 | finish at 2025-09-10 11:52:54 + [2025-09-10 08:34:36] iteration 9806/ 11920 | consumed samples: 10041344 | elapsed time per iteration (ms): 6067.8 | throughput per GPU (TFLOP/s/GPU): 74.4 | MFU 7.52% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832038E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:33:47.300062 | finish at 2025-09-10 12:08:23 + [2025-09-10 08:34:41] iteration 9807/ 11920 | consumed samples: 10042368 | elapsed time per iteration (ms): 5828.2 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823728E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:25:14.984296 | finish at 2025-09-10 11:59:56 + [2025-09-10 08:34:48] iteration 9808/ 11920 | consumed samples: 10043392 | elapsed time per iteration (ms): 6303.1 | throughput per GPU (TFLOP/s/GPU): 71.6 | MFU 7.24% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816182E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:41:52.223465 | finish at 2025-09-10 12:16:40 + [2025-09-10 08:34:53] iteration 9809/ 11920 | consumed samples: 10044416 | elapsed time per iteration (ms): 5621.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819853E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:17:45.897891 | finish at 2025-09-10 11:52:39 + [2025-09-10 08:34:59] iteration 9810/ 11920 | consumed samples: 10045440 | elapsed time per iteration (ms): 5869.3 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829864E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:26:24.293194 | finish at 2025-09-10 12:01:24 + [2025-09-10 08:35:05] iteration 9811/ 11920 | consumed samples: 10046464 | elapsed time per iteration (ms): 5616.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819037E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:17:25.159070 | finish at 2025-09-10 11:52:30 + [2025-09-10 08:35:11] iteration 9812/ 11920 | consumed samples: 10047488 | elapsed time per iteration (ms): 5841.3 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825933E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:25:13.466956 | finish at 2025-09-10 12:00:24 + [2025-09-10 08:35:16] iteration 9813/ 11920 | consumed samples: 10048512 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818480E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:17:23.786696 | finish at 2025-09-10 11:52:40 + [2025-09-10 08:35:22] iteration 9814/ 11920 | consumed samples: 10049536 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827506E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:17:15.954244 | finish at 2025-09-10 11:52:38 + [2025-09-10 08:35:28] iteration 9815/ 11920 | consumed samples: 10050560 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813953E+00 | loss scale: 1.0 | grad norm: 0.136 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:17:10.760723 | finish at 2025-09-10 11:52:38 + [2025-09-10 08:35:34] iteration 9816/ 11920 | consumed samples: 10051584 | elapsed time per iteration (ms): 5981.5 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808447E+00 | loss scale: 1.0 | grad norm: 0.133 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:29:45.060261 | finish at 2025-09-10 12:05:19 + [2025-09-10 08:35:39] iteration 9817/ 11920 | consumed samples: 10052608 | elapsed time per iteration (ms): 5628.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827742E+00 | loss scale: 1.0 | grad norm: 0.124 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:17:16.618643 | finish at 2025-09-10 11:52:56 + [2025-09-10 08:35:45] iteration 9818/ 11920 | consumed samples: 10053632 | elapsed time per iteration (ms): 5622.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819480E+00 | loss scale: 1.0 | grad norm: 0.119 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:16:58.768511 | finish at 2025-09-10 11:52:44 + [2025-09-10 08:35:50] iteration 9819/ 11920 | consumed samples: 10054656 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821623E+00 | loss scale: 1.0 | grad norm: 0.124 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:16:47.731464 | finish at 2025-09-10 11:52:38 + [2025-09-10 08:35:56] iteration 9820/ 11920 | consumed samples: 10055680 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824476E+00 | loss scale: 1.0 | grad norm: 0.130 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:16:57.354584 | finish at 2025-09-10 11:52:53 + [2025-09-10 08:36:02] iteration 9821/ 11920 | consumed samples: 10056704 | elapsed time per iteration (ms): 5615.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818953E+00 | loss scale: 1.0 | grad norm: 0.134 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:16:27.672594 | finish at 2025-09-10 11:52:29 + [2025-09-10 08:36:07] iteration 9822/ 11920 | consumed samples: 10057728 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824177E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:16:42.773116 | finish at 2025-09-10 11:52:50 + [2025-09-10 08:36:13] iteration 9823/ 11920 | consumed samples: 10058752 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817420E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:16:34.072613 | finish at 2025-09-10 11:52:47 + [2025-09-10 08:36:19] iteration 9824/ 11920 | consumed samples: 10059776 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805515E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:16:21.310276 | finish at 2025-09-10 11:52:40 + [2025-09-10 08:36:25] iteration 9825/ 11920 | consumed samples: 10060800 | elapsed time per iteration (ms): 6011.4 | throughput per GPU (TFLOP/s/GPU): 75.1 | MFU 7.59% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821193E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:29:53.819033 | finish at 2025-09-10 12:06:18 + [2025-09-10 08:36:30] iteration 9826/ 11920 | consumed samples: 10061824 | elapsed time per iteration (ms): 5632.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812389E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:16:33.424411 | finish at 2025-09-10 11:53:04 + [2025-09-10 08:36:36] iteration 9827/ 11920 | consumed samples: 10062848 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827849E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:16:06.116403 | finish at 2025-09-10 11:52:42 + [2025-09-10 08:36:41] iteration 9828/ 11920 | consumed samples: 10063872 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826825E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:16:03.825549 | finish at 2025-09-10 11:52:45 + [2025-09-10 08:36:47] iteration 9829/ 11920 | consumed samples: 10064896 | elapsed time per iteration (ms): 5616.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817662E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:15:43.943258 | finish at 2025-09-10 11:52:31 + [2025-09-10 08:36:53] iteration 9830/ 11920 | consumed samples: 10065920 | elapsed time per iteration (ms): 5626.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817618E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:15:58.365760 | finish at 2025-09-10 11:52:51 + [2025-09-10 08:36:58] iteration 9831/ 11920 | consumed samples: 10066944 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820230E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:15:59.693113 | finish at 2025-09-10 11:52:58 + [2025-09-10 08:37:04] iteration 9832/ 11920 | consumed samples: 10067968 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818909E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:15:36.291670 | finish at 2025-09-10 11:52:40 + [2025-09-10 08:37:10] iteration 9833/ 11920 | consumed samples: 10068992 | elapsed time per iteration (ms): 5863.1 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826413E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:23:56.282842 | finish at 2025-09-10 12:01:06 + [2025-09-10 08:37:15] iteration 9834/ 11920 | consumed samples: 10070016 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811429E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:15:26.725056 | finish at 2025-09-10 11:52:42 + [2025-09-10 08:37:21] iteration 9835/ 11920 | consumed samples: 10071040 | elapsed time per iteration (ms): 5617.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825403E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:15:13.152763 | finish at 2025-09-10 11:52:34 + [2025-09-10 08:37:27] iteration 9836/ 11920 | consumed samples: 10072064 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830315E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:15:12.950268 | finish at 2025-09-10 11:52:40 + [2025-09-10 08:37:32] iteration 9837/ 11920 | consumed samples: 10073088 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801210E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:15:21.711640 | finish at 2025-09-10 11:52:54 + [2025-09-10 08:37:38] iteration 9838/ 11920 | consumed samples: 10074112 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820224E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:15:12.684063 | finish at 2025-09-10 11:52:51 + [2025-09-10 08:37:44] iteration 9839/ 11920 | consumed samples: 10075136 | elapsed time per iteration (ms): 5626.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814617E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:15:07.745045 | finish at 2025-09-10 11:52:51 + [2025-09-10 08:37:49] iteration 9840/ 11920 | consumed samples: 10076160 | elapsed time per iteration (ms): 5925.4 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820592E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:25:24.789391 | finish at 2025-09-10 12:03:14 + [2025-09-10 08:37:55] iteration 9841/ 11920 | consumed samples: 10077184 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812088E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:14:51.378661 | finish at 2025-09-10 11:52:46 + [2025-09-10 08:38:01] iteration 9842/ 11920 | consumed samples: 10078208 | elapsed time per iteration (ms): 5826.9 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818121E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:21:48.284339 | finish at 2025-09-10 11:59:49 + [2025-09-10 08:38:07] iteration 9843/ 11920 | consumed samples: 10079232 | elapsed time per iteration (ms): 5616.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815702E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:14:25.525756 | finish at 2025-09-10 11:52:32 + [2025-09-10 08:38:12] iteration 9844/ 11920 | consumed samples: 10080256 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826580E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:14:34.871284 | finish at 2025-09-10 11:52:47 + [2025-09-10 08:38:18] iteration 9845/ 11920 | consumed samples: 10081280 | elapsed time per iteration (ms): 5619.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808463E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:14:19.976524 | finish at 2025-09-10 11:52:38 + [2025-09-10 08:38:23] iteration 9846/ 11920 | consumed samples: 10082304 | elapsed time per iteration (ms): 5622.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816307E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:14:21.114329 | finish at 2025-09-10 11:52:45 + [2025-09-10 08:38:29] iteration 9847/ 11920 | consumed samples: 10083328 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820152E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:14:25.302504 | finish at 2025-09-10 11:52:54 + [2025-09-10 08:38:35] iteration 9848/ 11920 | consumed samples: 10084352 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819159E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:14:10.732306 | finish at 2025-09-10 11:52:45 + [2025-09-10 08:38:40] iteration 9849/ 11920 | consumed samples: 10085376 | elapsed time per iteration (ms): 5635.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813322E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:14:31.127812 | finish at 2025-09-10 11:53:11 + [2025-09-10 08:38:46] iteration 9850/ 11920 | consumed samples: 10086400 | elapsed time per iteration (ms): 5619.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813520E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:13:52.690072 | finish at 2025-09-10 11:52:39 + [2025-09-10 08:38:52] iteration 9851/ 11920 | consumed samples: 10087424 | elapsed time per iteration (ms): 5617.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827798E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:13:41.703934 | finish at 2025-09-10 11:52:33 + [2025-09-10 08:38:57] iteration 9852/ 11920 | consumed samples: 10088448 | elapsed time per iteration (ms): 5626.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809515E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:13:55.076677 | finish at 2025-09-10 11:52:52 + [2025-09-10 08:39:03] iteration 9853/ 11920 | consumed samples: 10089472 | elapsed time per iteration (ms): 5617.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823025E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:13:32.272511 | finish at 2025-09-10 11:52:35 + [2025-09-10 08:39:08] iteration 9854/ 11920 | consumed samples: 10090496 | elapsed time per iteration (ms): 5617.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811034E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:13:26.340314 | finish at 2025-09-10 11:52:35 + [2025-09-10 08:39:14] iteration 9855/ 11920 | consumed samples: 10091520 | elapsed time per iteration (ms): 5636.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799228E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:13:58.279766 | finish at 2025-09-10 11:53:12 + [2025-09-10 08:39:20] iteration 9856/ 11920 | consumed samples: 10092544 | elapsed time per iteration (ms): 5622.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825151E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:13:24.876297 | finish at 2025-09-10 11:52:45 + [2025-09-10 08:39:25] iteration 9857/ 11920 | consumed samples: 10093568 | elapsed time per iteration (ms): 5618.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808664E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:13:10.752514 | finish at 2025-09-10 11:52:36 + [2025-09-10 08:39:31] iteration 9858/ 11920 | consumed samples: 10094592 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825413E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:13:13.017721 | finish at 2025-09-10 11:52:44 + [2025-09-10 08:39:37] iteration 9859/ 11920 | consumed samples: 10095616 | elapsed time per iteration (ms): 5639.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822727E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:13:43.481516 | finish at 2025-09-10 11:53:20 + [2025-09-10 08:39:42] iteration 9860/ 11920 | consumed samples: 10096640 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807612E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:12:57.935987 | finish at 2025-09-10 11:52:40 + [2025-09-10 08:39:48] iteration 9861/ 11920 | consumed samples: 10097664 | elapsed time per iteration (ms): 5838.9 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806225E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:20:22.278593 | finish at 2025-09-10 12:00:10 + [2025-09-10 08:39:54] iteration 9862/ 11920 | consumed samples: 10098688 | elapsed time per iteration (ms): 5619.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821418E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 10.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:12:44.676184 | finish at 2025-09-10 11:52:38 + [2025-09-10 08:39:59] iteration 9863/ 11920 | consumed samples: 10099712 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810391E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:12:48.442600 | finish at 2025-09-10 11:52:48 + [2025-09-10 08:40:05] iteration 9864/ 11920 | consumed samples: 10100736 | elapsed time per iteration (ms): 5960.7 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811802E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:24:15.144861 | finish at 2025-09-10 12:04:20 + [2025-09-10 08:40:11] iteration 9865/ 11920 | consumed samples: 10101760 | elapsed time per iteration (ms): 5617.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815480E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:12:23.847989 | finish at 2025-09-10 11:52:35 + [2025-09-10 08:40:16] iteration 9866/ 11920 | consumed samples: 10102784 | elapsed time per iteration (ms): 5619.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812696E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:12:21.808379 | finish at 2025-09-10 11:52:38 + [2025-09-10 08:40:22] iteration 9867/ 11920 | consumed samples: 10103808 | elapsed time per iteration (ms): 5619.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808196E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:12:16.188703 | finish at 2025-09-10 11:52:38 + [2025-09-10 08:40:28] iteration 9868/ 11920 | consumed samples: 10104832 | elapsed time per iteration (ms): 5617.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811316E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:12:07.537728 | finish at 2025-09-10 11:52:35 + [2025-09-10 08:40:33] iteration 9869/ 11920 | consumed samples: 10105856 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820225E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:12:14.016326 | finish at 2025-09-10 11:52:47 + [2025-09-10 08:40:39] iteration 9870/ 11920 | consumed samples: 10106880 | elapsed time per iteration (ms): 5633.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818350E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:12:27.563279 | finish at 2025-09-10 11:53:06 + [2025-09-10 08:40:45] iteration 9871/ 11920 | consumed samples: 10107904 | elapsed time per iteration (ms): 5933.3 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819754E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:22:37.326852 | finish at 2025-09-10 12:03:22 + [2025-09-10 08:40:51] iteration 9872/ 11920 | consumed samples: 10108928 | elapsed time per iteration (ms): 5914.6 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817770E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:21:53.073242 | finish at 2025-09-10 12:02:44 + [2025-09-10 08:40:56] iteration 9873/ 11920 | consumed samples: 10109952 | elapsed time per iteration (ms): 5637.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817605E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:12:19.009450 | finish at 2025-09-10 11:53:15 + [2025-09-10 08:41:02] iteration 9874/ 11920 | consumed samples: 10110976 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827675E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:11:48.221708 | finish at 2025-09-10 11:52:50 + [2025-09-10 08:41:08] iteration 9875/ 11920 | consumed samples: 10112000 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811436E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:11:34.118193 | finish at 2025-09-10 11:52:42 + [2025-09-10 08:41:13] iteration 9876/ 11920 | consumed samples: 10113024 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809473E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:11:37.461014 | finish at 2025-09-10 11:52:51 + [2025-09-10 08:41:19] iteration 9877/ 11920 | consumed samples: 10114048 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812247E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:11:26.397195 | finish at 2025-09-10 11:52:45 + [2025-09-10 08:41:25] iteration 9878/ 11920 | consumed samples: 10115072 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822438E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:11:27.718342 | finish at 2025-09-10 11:52:52 + [2025-09-10 08:41:30] iteration 9879/ 11920 | consumed samples: 10116096 | elapsed time per iteration (ms): 5849.2 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821096E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:18:58.177805 | finish at 2025-09-10 12:00:29 + [2025-09-10 08:41:36] iteration 9880/ 11920 | consumed samples: 10117120 | elapsed time per iteration (ms): 5628.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.833937E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:11:21.379280 | finish at 2025-09-10 11:52:57 + [2025-09-10 08:41:42] iteration 9881/ 11920 | consumed samples: 10118144 | elapsed time per iteration (ms): 5839.3 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807778E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:18:26.236763 | finish at 2025-09-10 12:00:08 + [2025-09-10 08:41:47] iteration 9882/ 11920 | consumed samples: 10119168 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814356E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:10:55.781288 | finish at 2025-09-10 11:52:43 + [2025-09-10 08:41:53] iteration 9883/ 11920 | consumed samples: 10120192 | elapsed time per iteration (ms): 5628.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817175E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:11:04.384654 | finish at 2025-09-10 11:52:57 + [2025-09-10 08:41:59] iteration 9884/ 11920 | consumed samples: 10121216 | elapsed time per iteration (ms): 5943.2 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817892E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:21:40.333583 | finish at 2025-09-10 12:03:39 + [2025-09-10 08:42:05] iteration 9885/ 11920 | consumed samples: 10122240 | elapsed time per iteration (ms): 5614.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813990E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:10:25.991319 | finish at 2025-09-10 11:52:31 + [2025-09-10 08:42:10] iteration 9886/ 11920 | consumed samples: 10123264 | elapsed time per iteration (ms): 5614.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806776E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:10:20.567649 | finish at 2025-09-10 11:52:31 + [2025-09-10 08:42:16] iteration 9887/ 11920 | consumed samples: 10124288 | elapsed time per iteration (ms): 5615.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816667E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:10:16.359432 | finish at 2025-09-10 11:52:32 + [2025-09-10 08:42:21] iteration 9888/ 11920 | consumed samples: 10125312 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835546E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:10:21.313030 | finish at 2025-09-10 11:52:43 + [2025-09-10 08:42:27] iteration 9889/ 11920 | consumed samples: 10126336 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807626E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:10:14.464787 | finish at 2025-09-10 11:52:42 + [2025-09-10 08:42:33] iteration 9890/ 11920 | consumed samples: 10127360 | elapsed time per iteration (ms): 5866.6 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827763E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:18:29.145803 | finish at 2025-09-10 12:01:02 + [2025-09-10 08:42:39] iteration 9891/ 11920 | consumed samples: 10128384 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800118E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:10:09.455746 | finish at 2025-09-10 11:52:48 + [2025-09-10 08:42:44] iteration 9892/ 11920 | consumed samples: 10129408 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819458E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:09:57.064342 | finish at 2025-09-10 11:52:41 + [2025-09-10 08:42:50] iteration 9893/ 11920 | consumed samples: 10130432 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820731E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:09:54.241680 | finish at 2025-09-10 11:52:44 + [2025-09-10 08:42:55] iteration 9894/ 11920 | consumed samples: 10131456 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818359E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:09:56.348539 | finish at 2025-09-10 11:52:52 + [2025-09-10 08:43:01] iteration 9895/ 11920 | consumed samples: 10132480 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816939E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:09:43.471388 | finish at 2025-09-10 11:52:45 + [2025-09-10 08:43:07] iteration 9896/ 11920 | consumed samples: 10133504 | elapsed time per iteration (ms): 5928.1 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815243E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:19:58.403685 | finish at 2025-09-10 12:03:05 + [2025-09-10 08:43:13] iteration 9897/ 11920 | consumed samples: 10134528 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818928E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:09:29.273273 | finish at 2025-09-10 11:52:42 + [2025-09-10 08:43:18] iteration 9898/ 11920 | consumed samples: 10135552 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823523E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:09:34.342961 | finish at 2025-09-10 11:52:53 + [2025-09-10 08:43:24] iteration 9899/ 11920 | consumed samples: 10136576 | elapsed time per iteration (ms): 5628.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828590E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:09:35.164740 | finish at 2025-09-10 11:52:59 + [2025-09-10 08:43:30] iteration 9900/ 11920 | consumed samples: 10137600 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813890E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:09:18.179517 | finish at 2025-09-10 11:52:48 + [2025-09-10 08:43:35] iteration 9901/ 11920 | consumed samples: 10138624 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806019E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:09:12.225475 | finish at 2025-09-10 11:52:47 + [2025-09-10 08:43:41] iteration 9902/ 11920 | consumed samples: 10139648 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821946E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:09:12.577915 | finish at 2025-09-10 11:52:53 + [2025-09-10 08:43:47] iteration 9903/ 11920 | consumed samples: 10140672 | elapsed time per iteration (ms): 5872.5 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802220E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:17:24.830461 | finish at 2025-09-10 12:01:11 + [2025-09-10 08:43:53] iteration 9904/ 11920 | consumed samples: 10141696 | elapsed time per iteration (ms): 5954.3 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800156E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:20:03.946518 | finish at 2025-09-10 12:03:57 + [2025-09-10 08:43:58] iteration 9905/ 11920 | consumed samples: 10142720 | elapsed time per iteration (ms): 5630.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815713E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:09:04.525656 | finish at 2025-09-10 11:53:03 + [2025-09-10 08:44:04] iteration 9906/ 11920 | consumed samples: 10143744 | elapsed time per iteration (ms): 5617.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820780E+00 | loss scale: 1.0 | grad norm: 0.240 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:08:32.742886 | finish at 2025-09-10 11:52:37 + [2025-09-10 08:44:09] iteration 9907/ 11920 | consumed samples: 10144768 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800931E+00 | loss scale: 1.0 | grad norm: 0.238 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:08:39.937741 | finish at 2025-09-10 11:52:49 + [2025-09-10 08:44:15] iteration 9908/ 11920 | consumed samples: 10145792 | elapsed time per iteration (ms): 5925.0 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810457E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:18:41.076399 | finish at 2025-09-10 12:02:56 + [2025-09-10 08:44:21] iteration 9909/ 11920 | consumed samples: 10146816 | elapsed time per iteration (ms): 5619.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813745E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:08:19.736517 | finish at 2025-09-10 11:52:41 + [2025-09-10 08:44:27] iteration 9910/ 11920 | consumed samples: 10147840 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800594E+00 | loss scale: 1.0 | grad norm: 0.246 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:08:17.267954 | finish at 2025-09-10 11:52:44 + [2025-09-10 08:44:32] iteration 9911/ 11920 | consumed samples: 10148864 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837581E+00 | loss scale: 1.0 | grad norm: 0.242 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:08:22.917891 | finish at 2025-09-10 11:52:55 + [2025-09-10 08:44:38] iteration 9912/ 11920 | consumed samples: 10149888 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816064E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:08:23.628891 | finish at 2025-09-10 11:53:02 + [2025-09-10 08:44:44] iteration 9913/ 11920 | consumed samples: 10150912 | elapsed time per iteration (ms): 5617.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819916E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:07:54.674814 | finish at 2025-09-10 11:52:38 + [2025-09-10 08:44:49] iteration 9914/ 11920 | consumed samples: 10151936 | elapsed time per iteration (ms): 5627.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802447E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:08:09.014771 | finish at 2025-09-10 11:52:58 + [2025-09-10 08:44:55] iteration 9915/ 11920 | consumed samples: 10152960 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817187E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:07:54.715695 | finish at 2025-09-10 11:52:49 + [2025-09-10 08:45:00] iteration 9916/ 11920 | consumed samples: 10153984 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813237E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:07:49.262012 | finish at 2025-09-10 11:52:50 + [2025-09-10 08:45:06] iteration 9917/ 11920 | consumed samples: 10155008 | elapsed time per iteration (ms): 5631.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820170E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:07:59.285155 | finish at 2025-09-10 11:53:05 + [2025-09-10 08:45:12] iteration 9918/ 11920 | consumed samples: 10156032 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817279E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:07:30.919493 | finish at 2025-09-10 11:52:43 + [2025-09-10 08:45:17] iteration 9919/ 11920 | consumed samples: 10157056 | elapsed time per iteration (ms): 5814.2 | throughput per GPU (TFLOP/s/GPU): 77.7 | MFU 7.85% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804147E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:13:54.160139 | finish at 2025-09-10 11:59:12 + [2025-09-10 08:45:23] iteration 9920/ 11920 | consumed samples: 10158080 | elapsed time per iteration (ms): 5621.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806370E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:07:21.968155 | finish at 2025-09-10 11:52:45 + [2025-09-10 08:45:29] iteration 9921/ 11920 | consumed samples: 10159104 | elapsed time per iteration (ms): 6312.2 | throughput per GPU (TFLOP/s/GPU): 71.5 | MFU 7.23% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816309E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:30:18.015496 | finish at 2025-09-10 12:15:47 + [2025-09-10 08:45:35] iteration 9922/ 11920 | consumed samples: 10160128 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814694E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:07:08.322473 | finish at 2025-09-10 11:52:43 + [2025-09-10 08:45:41] iteration 9923/ 11920 | consumed samples: 10161152 | elapsed time per iteration (ms): 5629.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816317E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:07:22.384143 | finish at 2025-09-10 11:53:03 + [2025-09-10 08:45:47] iteration 9924/ 11920 | consumed samples: 10162176 | elapsed time per iteration (ms): 5974.5 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826930E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:18:45.129483 | finish at 2025-09-10 12:04:32 + [2025-09-10 08:45:52] iteration 9925/ 11920 | consumed samples: 10163200 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814992E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:06:59.049668 | finish at 2025-09-10 11:52:51 + [2025-09-10 08:45:58] iteration 9926/ 11920 | consumed samples: 10164224 | elapsed time per iteration (ms): 5897.7 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812143E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:15:59.974483 | finish at 2025-09-10 12:01:58 + [2025-09-10 08:46:04] iteration 9927/ 11920 | consumed samples: 10165248 | elapsed time per iteration (ms): 5612.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815900E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:06:25.004879 | finish at 2025-09-10 11:52:29 + [2025-09-10 08:46:09] iteration 9928/ 11920 | consumed samples: 10166272 | elapsed time per iteration (ms): 5618.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817441E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:06:32.450455 | finish at 2025-09-10 11:52:42 + [2025-09-10 08:46:15] iteration 9929/ 11920 | consumed samples: 10167296 | elapsed time per iteration (ms): 5619.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821726E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:06:28.122915 | finish at 2025-09-10 11:52:43 + [2025-09-10 08:46:21] iteration 9930/ 11920 | consumed samples: 10168320 | elapsed time per iteration (ms): 5616.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819018E+00 | loss scale: 1.0 | grad norm: 0.142 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:06:17.528453 | finish at 2025-09-10 11:52:38 + [2025-09-10 08:46:26] iteration 9931/ 11920 | consumed samples: 10169344 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813947E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:06:26.982143 | finish at 2025-09-10 11:52:53 + [2025-09-10 08:46:32] iteration 9932/ 11920 | consumed samples: 10170368 | elapsed time per iteration (ms): 5628.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819916E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:06:28.491532 | finish at 2025-09-10 11:53:00 + [2025-09-10 08:46:38] iteration 9933/ 11920 | consumed samples: 10171392 | elapsed time per iteration (ms): 5978.4 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818480E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:17:59.058518 | finish at 2025-09-10 12:04:37 + [2025-09-10 08:46:43] iteration 9934/ 11920 | consumed samples: 10172416 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812137E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:06:09.940301 | finish at 2025-09-10 11:52:53 + [2025-09-10 08:46:49] iteration 9935/ 11920 | consumed samples: 10173440 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807629E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:06:01.406353 | finish at 2025-09-10 11:52:50 + [2025-09-10 08:46:55] iteration 9936/ 11920 | consumed samples: 10174464 | elapsed time per iteration (ms): 5858.3 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821798E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:13:42.792404 | finish at 2025-09-10 12:00:38 + [2025-09-10 08:47:01] iteration 9937/ 11920 | consumed samples: 10175488 | elapsed time per iteration (ms): 5618.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815154E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:05:40.780095 | finish at 2025-09-10 11:52:41 + [2025-09-10 08:47:06] iteration 9938/ 11920 | consumed samples: 10176512 | elapsed time per iteration (ms): 5615.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820600E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:05:30.778617 | finish at 2025-09-10 11:52:37 + [2025-09-10 08:47:12] iteration 9939/ 11920 | consumed samples: 10177536 | elapsed time per iteration (ms): 5618.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814738E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:05:29.188631 | finish at 2025-09-10 11:52:41 + [2025-09-10 08:47:17] iteration 9940/ 11920 | consumed samples: 10178560 | elapsed time per iteration (ms): 5619.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804040E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:05:26.751938 | finish at 2025-09-10 11:52:44 + [2025-09-10 08:47:23] iteration 9941/ 11920 | consumed samples: 10179584 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824478E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:05:22.907392 | finish at 2025-09-10 11:52:46 + [2025-09-10 08:47:29] iteration 9942/ 11920 | consumed samples: 10180608 | elapsed time per iteration (ms): 5867.0 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821458E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:13:24.915829 | finish at 2025-09-10 12:00:54 + [2025-09-10 08:47:35] iteration 9943/ 11920 | consumed samples: 10181632 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822073E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:05:25.508694 | finish at 2025-09-10 11:53:00 + [2025-09-10 08:47:40] iteration 9944/ 11920 | consumed samples: 10182656 | elapsed time per iteration (ms): 5617.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812117E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:04:59.241199 | finish at 2025-09-10 11:52:39 + [2025-09-10 08:47:46] iteration 9945/ 11920 | consumed samples: 10183680 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816326E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:05:00.659072 | finish at 2025-09-10 11:52:46 + [2025-09-10 08:47:51] iteration 9946/ 11920 | consumed samples: 10184704 | elapsed time per iteration (ms): 5629.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801486E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:05:11.792267 | finish at 2025-09-10 11:53:03 + [2025-09-10 08:47:57] iteration 9947/ 11920 | consumed samples: 10185728 | elapsed time per iteration (ms): 5630.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811256E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:05:09.552894 | finish at 2025-09-10 11:53:07 + [2025-09-10 08:48:03] iteration 9948/ 11920 | consumed samples: 10186752 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814551E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:04:47.225729 | finish at 2025-09-10 11:52:50 + [2025-09-10 08:48:08] iteration 9949/ 11920 | consumed samples: 10187776 | elapsed time per iteration (ms): 5618.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819430E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:04:34.122699 | finish at 2025-09-10 11:52:42 + [2025-09-10 08:48:14] iteration 9950/ 11920 | consumed samples: 10188800 | elapsed time per iteration (ms): 5618.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815200E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:04:28.991702 | finish at 2025-09-10 11:52:43 + [2025-09-10 08:48:20] iteration 9951/ 11920 | consumed samples: 10189824 | elapsed time per iteration (ms): 6327.5 | throughput per GPU (TFLOP/s/GPU): 71.4 | MFU 7.21% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812709E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:27:38.825643 | finish at 2025-09-10 12:15:59 + [2025-09-10 08:48:26] iteration 9952/ 11920 | consumed samples: 10190848 | elapsed time per iteration (ms): 6165.4 | throughput per GPU (TFLOP/s/GPU): 73.2 | MFU 7.40% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819911E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:22:13.469250 | finish at 2025-09-10 12:10:40 + [2025-09-10 08:48:32] iteration 9953/ 11920 | consumed samples: 10191872 | elapsed time per iteration (ms): 5618.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815311E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:04:11.420660 | finish at 2025-09-10 11:52:43 + [2025-09-10 08:48:38] iteration 9954/ 11920 | consumed samples: 10192896 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815109E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:04:16.306974 | finish at 2025-09-10 11:52:54 + [2025-09-10 08:48:43] iteration 9955/ 11920 | consumed samples: 10193920 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819488E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:04:02.368412 | finish at 2025-09-10 11:52:46 + [2025-09-10 08:48:49] iteration 9956/ 11920 | consumed samples: 10194944 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807120E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:04:03.063293 | finish at 2025-09-10 11:52:52 + [2025-09-10 08:48:55] iteration 9957/ 11920 | consumed samples: 10195968 | elapsed time per iteration (ms): 5966.6 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817634E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:15:12.521332 | finish at 2025-09-10 12:04:07 + [2025-09-10 08:49:00] iteration 9958/ 11920 | consumed samples: 10196992 | elapsed time per iteration (ms): 5618.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817076E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:03:43.214919 | finish at 2025-09-10 11:52:44 + [2025-09-10 08:49:06] iteration 9959/ 11920 | consumed samples: 10198016 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823731E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:03:47.492957 | finish at 2025-09-10 11:52:54 + [2025-09-10 08:49:12] iteration 9960/ 11920 | consumed samples: 10199040 | elapsed time per iteration (ms): 5616.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800981E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:03:28.286533 | finish at 2025-09-10 11:52:40 + [2025-09-10 08:49:17] iteration 9961/ 11920 | consumed samples: 10200064 | elapsed time per iteration (ms): 5619.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811983E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:03:28.721783 | finish at 2025-09-10 11:52:46 + [2025-09-10 08:49:23] iteration 9962/ 11920 | consumed samples: 10201088 | elapsed time per iteration (ms): 5615.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828496E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:03:15.759553 | finish at 2025-09-10 11:52:39 + [2025-09-10 08:49:29] iteration 9963/ 11920 | consumed samples: 10202112 | elapsed time per iteration (ms): 5842.3 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829553E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:10:33.405708 | finish at 2025-09-10 12:00:02 + [2025-09-10 08:49:34] iteration 9964/ 11920 | consumed samples: 10203136 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806642E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:03:13.306440 | finish at 2025-09-10 11:52:48 + [2025-09-10 08:49:40] iteration 9965/ 11920 | consumed samples: 10204160 | elapsed time per iteration (ms): 5614.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813098E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:02:55.870295 | finish at 2025-09-10 11:52:36 + [2025-09-10 08:49:46] iteration 9966/ 11920 | consumed samples: 10205184 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810778E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:03:03.246821 | finish at 2025-09-10 11:52:49 + [2025-09-10 08:49:51] iteration 9967/ 11920 | consumed samples: 10206208 | elapsed time per iteration (ms): 5634.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810462E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:03:23.365094 | finish at 2025-09-10 11:53:15 + [2025-09-10 08:49:57] iteration 9968/ 11920 | consumed samples: 10207232 | elapsed time per iteration (ms): 5631.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809376E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:03:12.372009 | finish at 2025-09-10 11:53:09 + [2025-09-10 08:50:03] iteration 9969/ 11920 | consumed samples: 10208256 | elapsed time per iteration (ms): 5870.4 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825342E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:10:53.162865 | finish at 2025-09-10 12:00:56 + [2025-09-10 08:50:08] iteration 9970/ 11920 | consumed samples: 10209280 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823315E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:02:40.670686 | finish at 2025-09-10 11:52:49 + [2025-09-10 08:50:14] iteration 9971/ 11920 | consumed samples: 10210304 | elapsed time per iteration (ms): 6011.0 | throughput per GPU (TFLOP/s/GPU): 75.1 | MFU 7.59% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811611E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:15:15.471832 | finish at 2025-09-10 12:05:30 + [2025-09-10 08:50:20] iteration 9972/ 11920 | consumed samples: 10211328 | elapsed time per iteration (ms): 5635.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804717E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:02:57.588861 | finish at 2025-09-10 11:53:18 + [2025-09-10 08:50:26] iteration 9973/ 11920 | consumed samples: 10212352 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814153E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:02:23.506849 | finish at 2025-09-10 11:52:49 + [2025-09-10 08:50:31] iteration 9974/ 11920 | consumed samples: 10213376 | elapsed time per iteration (ms): 5614.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815339E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:02:06.569636 | finish at 2025-09-10 11:52:38 + [2025-09-10 08:50:37] iteration 9975/ 11920 | consumed samples: 10214400 | elapsed time per iteration (ms): 5615.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818271E+00 | loss scale: 1.0 | grad norm: 0.241 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:02:02.213761 | finish at 2025-09-10 11:52:39 + [2025-09-10 08:50:42] iteration 9976/ 11920 | consumed samples: 10215424 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826390E+00 | loss scale: 1.0 | grad norm: 0.263 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:02:09.541065 | finish at 2025-09-10 11:52:52 + [2025-09-10 08:50:48] iteration 9977/ 11920 | consumed samples: 10216448 | elapsed time per iteration (ms): 5626.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818864E+00 | loss scale: 1.0 | grad norm: 0.286 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:02:12.914209 | finish at 2025-09-10 11:53:01 + [2025-09-10 08:50:54] iteration 9978/ 11920 | consumed samples: 10217472 | elapsed time per iteration (ms): 5642.2 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815105E+00 | loss scale: 1.0 | grad norm: 0.259 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:02:37.213504 | finish at 2025-09-10 11:53:31 + [2025-09-10 08:50:59] iteration 9979/ 11920 | consumed samples: 10218496 | elapsed time per iteration (ms): 5626.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830608E+00 | loss scale: 1.0 | grad norm: 0.264 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:02:01.164014 | finish at 2025-09-10 11:53:01 + [2025-09-10 08:51:05] iteration 9980/ 11920 | consumed samples: 10219520 | elapsed time per iteration (ms): 5626.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810143E+00 | loss scale: 1.0 | grad norm: 0.249 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:01:56.260386 | finish at 2025-09-10 11:53:01 + [2025-09-10 08:51:11] iteration 9981/ 11920 | consumed samples: 10220544 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827879E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:01:38.783012 | finish at 2025-09-10 11:52:49 + [2025-09-10 08:51:17] iteration 9982/ 11920 | consumed samples: 10221568 | elapsed time per iteration (ms): 6182.8 | throughput per GPU (TFLOP/s/GPU): 73.0 | MFU 7.38% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820451E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:19:42.221224 | finish at 2025-09-10 12:10:59 + [2025-09-10 08:51:22] iteration 9983/ 11920 | consumed samples: 10222592 | elapsed time per iteration (ms): 5617.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821098E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:01:21.806980 | finish at 2025-09-10 11:52:44 + [2025-09-10 08:51:28] iteration 9984/ 11920 | consumed samples: 10223616 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820919E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:01:22.477196 | finish at 2025-09-10 11:52:51 + [2025-09-10 08:51:34] iteration 9985/ 11920 | consumed samples: 10224640 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821918E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:01:16.029822 | finish at 2025-09-10 11:52:50 + [2025-09-10 08:51:39] iteration 9986/ 11920 | consumed samples: 10225664 | elapsed time per iteration (ms): 5635.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822352E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:01:39.557668 | finish at 2025-09-10 11:53:19 + [2025-09-10 08:51:45] iteration 9987/ 11920 | consumed samples: 10226688 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824160E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:01:17.882951 | finish at 2025-09-10 11:53:03 + [2025-09-10 08:51:51] iteration 9988/ 11920 | consumed samples: 10227712 | elapsed time per iteration (ms): 5626.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819804E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:01:09.723435 | finish at 2025-09-10 11:53:00 + [2025-09-10 08:51:56] iteration 9989/ 11920 | consumed samples: 10228736 | elapsed time per iteration (ms): 5638.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820434E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:01:27.629469 | finish at 2025-09-10 11:53:24 + [2025-09-10 08:52:02] iteration 9990/ 11920 | consumed samples: 10229760 | elapsed time per iteration (ms): 5833.8 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819531E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:07:39.236450 | finish at 2025-09-10 11:59:41 + [2025-09-10 08:52:08] iteration 9991/ 11920 | consumed samples: 10230784 | elapsed time per iteration (ms): 5617.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820455E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:00:35.843511 | finish at 2025-09-10 11:52:44 + [2025-09-10 08:52:13] iteration 9992/ 11920 | consumed samples: 10231808 | elapsed time per iteration (ms): 5615.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809412E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:00:27.367020 | finish at 2025-09-10 11:52:41 + [2025-09-10 08:52:19] iteration 9993/ 11920 | consumed samples: 10232832 | elapsed time per iteration (ms): 5617.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830673E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:00:25.757418 | finish at 2025-09-10 11:52:45 + [2025-09-10 08:52:25] iteration 9994/ 11920 | consumed samples: 10233856 | elapsed time per iteration (ms): 5616.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810825E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:00:17.258041 | finish at 2025-09-10 11:52:42 + [2025-09-10 08:52:30] iteration 9995/ 11920 | consumed samples: 10234880 | elapsed time per iteration (ms): 5618.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819214E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:00:16.298628 | finish at 2025-09-10 11:52:46 + [2025-09-10 08:52:36] iteration 9996/ 11920 | consumed samples: 10235904 | elapsed time per iteration (ms): 5831.3 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826164E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:06:59.499221 | finish at 2025-09-10 11:59:35 + [2025-09-10 08:52:42] iteration 9997/ 11920 | consumed samples: 10236928 | elapsed time per iteration (ms): 5618.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813650E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:00:05.235137 | finish at 2025-09-10 11:52:47 + [2025-09-10 08:52:47] iteration 9998/ 11920 | consumed samples: 10237952 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816039E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:00:11.365935 | finish at 2025-09-10 11:52:59 + [2025-09-10 08:52:53] iteration 9999/ 11920 | consumed samples: 10238976 | elapsed time per iteration (ms): 5635.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807820E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:00:25.645645 | finish at 2025-09-10 11:53:18 + [2025-09-10 08:52:58] iteration 10000/ 11920 | consumed samples: 10240000 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818342E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:00:00.523224 | finish at 2025-09-10 11:52:59 + [2025-09-10 08:53:04] iteration 10001/ 11920 | consumed samples: 10241024 | elapsed time per iteration (ms): 5637.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818907E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:00:18.239060 | finish at 2025-09-10 11:53:22 + [2025-09-10 08:53:10] iteration 10002/ 11920 | consumed samples: 10242048 | elapsed time per iteration (ms): 5640.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826522E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:00:18.052026 | finish at 2025-09-10 11:53:28 + [2025-09-10 08:53:15] iteration 10003/ 11920 | consumed samples: 10243072 | elapsed time per iteration (ms): 5630.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826606E+00 | loss scale: 1.0 | grad norm: 0.230 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:59:53.629801 | finish at 2025-09-10 11:53:09 + [2025-09-10 08:53:21] iteration 10004/ 11920 | consumed samples: 10244096 | elapsed time per iteration (ms): 5634.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812569E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:59:54.711230 | finish at 2025-09-10 11:53:16 + [2025-09-10 08:53:27] iteration 10005/ 11920 | consumed samples: 10245120 | elapsed time per iteration (ms): 5863.0 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816687E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:07:07.727935 | finish at 2025-09-10 12:00:35 + [2025-09-10 08:53:32] iteration 10006/ 11920 | consumed samples: 10246144 | elapsed time per iteration (ms): 5623.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820011E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:59:22.456959 | finish at 2025-09-10 11:52:55 + [2025-09-10 08:53:38] iteration 10007/ 11920 | consumed samples: 10247168 | elapsed time per iteration (ms): 5618.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822108E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:59:08.616938 | finish at 2025-09-10 11:52:47 + [2025-09-10 08:53:44] iteration 10008/ 11920 | consumed samples: 10248192 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808784E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:59:13.659782 | finish at 2025-09-10 11:52:57 + [2025-09-10 08:53:49] iteration 10009/ 11920 | consumed samples: 10249216 | elapsed time per iteration (ms): 5626.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805110E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:59:11.731911 | finish at 2025-09-10 11:53:01 + [2025-09-10 08:53:55] iteration 10010/ 11920 | consumed samples: 10250240 | elapsed time per iteration (ms): 5630.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807413E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:59:13.890390 | finish at 2025-09-10 11:53:09 + [2025-09-10 08:54:01] iteration 10011/ 11920 | consumed samples: 10251264 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808562E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:58:56.942088 | finish at 2025-09-10 11:52:58 + [2025-09-10 08:54:06] iteration 10012/ 11920 | consumed samples: 10252288 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813044E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:58:42.614056 | finish at 2025-09-10 11:52:49 + [2025-09-10 08:54:12] iteration 10013/ 11920 | consumed samples: 10253312 | elapsed time per iteration (ms): 5932.1 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815907E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:08:32.557548 | finish at 2025-09-10 12:02:45 + [2025-09-10 08:54:18] iteration 10014/ 11920 | consumed samples: 10254336 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817933E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:58:34.118696 | finish at 2025-09-10 11:52:52 + [2025-09-10 08:54:24] iteration 10015/ 11920 | consumed samples: 10255360 | elapsed time per iteration (ms): 5841.4 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816903E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:05:27.772354 | finish at 2025-09-10 11:59:51 + [2025-09-10 08:54:29] iteration 10016/ 11920 | consumed samples: 10256384 | elapsed time per iteration (ms): 5616.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818068E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:58:13.651936 | finish at 2025-09-10 11:52:43 + [2025-09-10 08:54:35] iteration 10017/ 11920 | consumed samples: 10257408 | elapsed time per iteration (ms): 5619.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819594E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:58:14.247273 | finish at 2025-09-10 11:52:49 + [2025-09-10 08:54:41] iteration 10018/ 11920 | consumed samples: 10258432 | elapsed time per iteration (ms): 5823.4 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.84% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810226E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:04:36.157175 | finish at 2025-09-10 11:59:17 + [2025-09-10 08:54:46] iteration 10019/ 11920 | consumed samples: 10259456 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814752E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:58:07.717015 | finish at 2025-09-10 11:52:54 + [2025-09-10 08:54:52] iteration 10020/ 11920 | consumed samples: 10260480 | elapsed time per iteration (ms): 5870.1 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801365E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:05:53.129339 | finish at 2025-09-10 12:00:45 + [2025-09-10 08:54:58] iteration 10021/ 11920 | consumed samples: 10261504 | elapsed time per iteration (ms): 5907.0 | throughput per GPU (TFLOP/s/GPU): 76.4 | MFU 7.73% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799497E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:06:57.358261 | finish at 2025-09-10 12:01:55 + [2025-09-10 08:55:04] iteration 10022/ 11920 | consumed samples: 10262528 | elapsed time per iteration (ms): 5617.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822631E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:57:41.422328 | finish at 2025-09-10 11:52:45 + [2025-09-10 08:55:09] iteration 10023/ 11920 | consumed samples: 10263552 | elapsed time per iteration (ms): 5628.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801944E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:57:56.363529 | finish at 2025-09-10 11:53:06 + [2025-09-10 08:55:15] iteration 10024/ 11920 | consumed samples: 10264576 | elapsed time per iteration (ms): 5936.3 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826178E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:07:35.295393 | finish at 2025-09-10 12:02:51 + [2025-09-10 08:55:21] iteration 10025/ 11920 | consumed samples: 10265600 | elapsed time per iteration (ms): 5616.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821102E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:57:23.010236 | finish at 2025-09-10 11:52:44 + [2025-09-10 08:55:27] iteration 10026/ 11920 | consumed samples: 10266624 | elapsed time per iteration (ms): 5939.1 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806750E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:07:28.641001 | finish at 2025-09-10 12:02:55 + [2025-09-10 08:55:32] iteration 10027/ 11920 | consumed samples: 10267648 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806620E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:57:22.070908 | finish at 2025-09-10 11:52:55 + [2025-09-10 08:55:38] iteration 10028/ 11920 | consumed samples: 10268672 | elapsed time per iteration (ms): 5645.6 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820602E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:58:01.523168 | finish at 2025-09-10 11:53:40 + [2025-09-10 08:55:44] iteration 10029/ 11920 | consumed samples: 10269696 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823666E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:57:14.307863 | finish at 2025-09-10 11:52:58 + [2025-09-10 08:55:49] iteration 10030/ 11920 | consumed samples: 10270720 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823679E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:57:03.166037 | finish at 2025-09-10 11:52:53 + [2025-09-10 08:55:55] iteration 10031/ 11920 | consumed samples: 10271744 | elapsed time per iteration (ms): 5630.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805818E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:57:16.644269 | finish at 2025-09-10 11:53:12 + [2025-09-10 08:56:01] iteration 10032/ 11920 | consumed samples: 10272768 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808351E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:56:56.931885 | finish at 2025-09-10 11:52:58 + [2025-09-10 08:56:06] iteration 10033/ 11920 | consumed samples: 10273792 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821965E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:56:48.228073 | finish at 2025-09-10 11:52:54 + [2025-09-10 08:56:12] iteration 10034/ 11920 | consumed samples: 10274816 | elapsed time per iteration (ms): 5849.4 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815587E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:03:52.016146 | finish at 2025-09-10 12:00:04 + [2025-09-10 08:56:18] iteration 10035/ 11920 | consumed samples: 10275840 | elapsed time per iteration (ms): 5944.6 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812073E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:06:45.539739 | finish at 2025-09-10 12:03:04 + [2025-09-10 08:56:24] iteration 10036/ 11920 | consumed samples: 10276864 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811509E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:56:32.403148 | finish at 2025-09-10 11:52:56 + [2025-09-10 08:56:30] iteration 10037/ 11920 | consumed samples: 10277888 | elapsed time per iteration (ms): 5955.5 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814757E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:06:54.230082 | finish at 2025-09-10 12:03:24 + [2025-09-10 08:56:35] iteration 10038/ 11920 | consumed samples: 10278912 | elapsed time per iteration (ms): 5618.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817173E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:56:14.317621 | finish at 2025-09-10 11:52:50 + [2025-09-10 08:56:41] iteration 10039/ 11920 | consumed samples: 10279936 | elapsed time per iteration (ms): 5863.8 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807435E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:03:49.754076 | finish at 2025-09-10 12:00:31 + [2025-09-10 08:56:47] iteration 10040/ 11920 | consumed samples: 10280960 | elapsed time per iteration (ms): 5831.2 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807512E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:02:42.749014 | finish at 2025-09-10 11:59:30 + [2025-09-10 08:56:53] iteration 10041/ 11920 | consumed samples: 10281984 | elapsed time per iteration (ms): 5629.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812954E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:56:17.248398 | finish at 2025-09-10 11:53:10 + [2025-09-10 08:56:58] iteration 10042/ 11920 | consumed samples: 10283008 | elapsed time per iteration (ms): 5889.5 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814614E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:04:20.495145 | finish at 2025-09-10 12:01:19 + [2025-09-10 08:57:04] iteration 10043/ 11920 | consumed samples: 10284032 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799355E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:55:51.128603 | finish at 2025-09-10 11:52:55 + [2025-09-10 08:57:10] iteration 10044/ 11920 | consumed samples: 10285056 | elapsed time per iteration (ms): 5627.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820939E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:55:57.817184 | finish at 2025-09-10 11:53:07 + [2025-09-10 08:57:15] iteration 10045/ 11920 | consumed samples: 10286080 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812320E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:55:39.048314 | finish at 2025-09-10 11:52:54 + [2025-09-10 08:57:21] iteration 10046/ 11920 | consumed samples: 10287104 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812297E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:55:33.385043 | finish at 2025-09-10 11:52:54 + [2025-09-10 08:57:27] iteration 10047/ 11920 | consumed samples: 10288128 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807322E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:55:27.462366 | finish at 2025-09-10 11:52:54 + [2025-09-10 08:57:32] iteration 10048/ 11920 | consumed samples: 10289152 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807919E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:55:23.913094 | finish at 2025-09-10 11:52:56 + [2025-09-10 08:57:38] iteration 10049/ 11920 | consumed samples: 10290176 | elapsed time per iteration (ms): 5619.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824022E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:55:13.226094 | finish at 2025-09-10 11:52:51 + [2025-09-10 08:57:43] iteration 10050/ 11920 | consumed samples: 10291200 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808724E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:55:09.483604 | finish at 2025-09-10 11:52:53 + [2025-09-10 08:57:49] iteration 10051/ 11920 | consumed samples: 10292224 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.795943E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:55:03.893861 | finish at 2025-09-10 11:52:53 + [2025-09-10 08:57:55] iteration 10052/ 11920 | consumed samples: 10293248 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811719E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:55:05.750157 | finish at 2025-09-10 11:53:00 + [2025-09-10 08:58:00] iteration 10053/ 11920 | consumed samples: 10294272 | elapsed time per iteration (ms): 5616.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813998E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:54:46.117932 | finish at 2025-09-10 11:52:46 + [2025-09-10 08:58:06] iteration 10054/ 11920 | consumed samples: 10295296 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809297E+00 | loss scale: 1.0 | grad norm: 0.248 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:54:49.702568 | finish at 2025-09-10 11:52:56 + [2025-09-10 08:58:12] iteration 10055/ 11920 | consumed samples: 10296320 | elapsed time per iteration (ms): 5917.8 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802905E+00 | loss scale: 1.0 | grad norm: 0.242 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:03:56.778722 | finish at 2025-09-10 12:02:09 + [2025-09-10 08:58:17] iteration 10056/ 11920 | consumed samples: 10297344 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810598E+00 | loss scale: 1.0 | grad norm: 0.247 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:54:39.739048 | finish at 2025-09-10 11:52:57 + [2025-09-10 08:58:23] iteration 10057/ 11920 | consumed samples: 10298368 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826185E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:54:38.968581 | finish at 2025-09-10 11:53:02 + [2025-09-10 08:58:29] iteration 10058/ 11920 | consumed samples: 10299392 | elapsed time per iteration (ms): 5954.6 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818014E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:04:47.547062 | finish at 2025-09-10 12:03:17 + [2025-09-10 08:58:35] iteration 10059/ 11920 | consumed samples: 10300416 | elapsed time per iteration (ms): 5834.4 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818925E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:00:57.862655 | finish at 2025-09-10 11:59:33 + [2025-09-10 08:58:40] iteration 10060/ 11920 | consumed samples: 10301440 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819906E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:54:14.092026 | finish at 2025-09-10 11:52:55 + [2025-09-10 08:58:46] iteration 10061/ 11920 | consumed samples: 10302464 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821265E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:54:17.425923 | finish at 2025-09-10 11:53:04 + [2025-09-10 08:58:52] iteration 10062/ 11920 | consumed samples: 10303488 | elapsed time per iteration (ms): 5974.0 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813254E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:04:59.741571 | finish at 2025-09-10 12:03:52 + [2025-09-10 08:58:58] iteration 10063/ 11920 | consumed samples: 10304512 | elapsed time per iteration (ms): 5839.2 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809977E+00 | loss scale: 1.0 | grad norm: 0.241 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:00:43.435612 | finish at 2025-09-10 11:59:41 + [2025-09-10 08:59:04] iteration 10064/ 11920 | consumed samples: 10305536 | elapsed time per iteration (ms): 5821.1 | throughput per GPU (TFLOP/s/GPU): 77.6 | MFU 7.84% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840507E+00 | loss scale: 1.0 | grad norm: 0.251 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:00:03.955841 | finish at 2025-09-10 11:59:08 + [2025-09-10 08:59:09] iteration 10065/ 11920 | consumed samples: 10306560 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828126E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:53:47.622918 | finish at 2025-09-10 11:52:57 + [2025-09-10 08:59:15] iteration 10066/ 11920 | consumed samples: 10307584 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819112E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:53:44.602893 | finish at 2025-09-10 11:53:00 + [2025-09-10 08:59:21] iteration 10067/ 11920 | consumed samples: 10308608 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822280E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:53:41.803607 | finish at 2025-09-10 11:53:02 + [2025-09-10 08:59:26] iteration 10068/ 11920 | consumed samples: 10309632 | elapsed time per iteration (ms): 5629.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807938E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:53:46.017523 | finish at 2025-09-10 11:53:12 + [2025-09-10 08:59:32] iteration 10069/ 11920 | consumed samples: 10310656 | elapsed time per iteration (ms): 5617.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812490E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:53:17.063219 | finish at 2025-09-10 11:52:49 + [2025-09-10 08:59:38] iteration 10070/ 11920 | consumed samples: 10311680 | elapsed time per iteration (ms): 5847.0 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815584E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:00:17.032266 | finish at 2025-09-10 11:59:55 + [2025-09-10 08:59:43] iteration 10071/ 11920 | consumed samples: 10312704 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815882E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:53:15.619749 | finish at 2025-09-10 11:52:59 + [2025-09-10 08:59:49] iteration 10072/ 11920 | consumed samples: 10313728 | elapsed time per iteration (ms): 5953.2 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809466E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:03:21.586927 | finish at 2025-09-10 12:03:11 + [2025-09-10 08:59:55] iteration 10073/ 11920 | consumed samples: 10314752 | elapsed time per iteration (ms): 5624.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809508E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:53:08.613619 | finish at 2025-09-10 11:53:03 + [2025-09-10 09:00:01] iteration 10074/ 11920 | consumed samples: 10315776 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818583E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:53:00.731212 | finish at 2025-09-10 11:53:01 + [2025-09-10 09:00:06] iteration 10075/ 11920 | consumed samples: 10316800 | elapsed time per iteration (ms): 5628.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810298E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:53:04.891269 | finish at 2025-09-10 11:53:11 + [2025-09-10 09:00:12] iteration 10076/ 11920 | consumed samples: 10317824 | elapsed time per iteration (ms): 5640.2 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813541E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:53:20.504435 | finish at 2025-09-10 11:53:32 + [2025-09-10 09:00:17] iteration 10077/ 11920 | consumed samples: 10318848 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817349E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:52:43.631309 | finish at 2025-09-10 11:53:01 + [2025-09-10 09:00:23] iteration 10078/ 11920 | consumed samples: 10319872 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809607E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:52:34.042391 | finish at 2025-09-10 11:52:57 + [2025-09-10 09:00:29] iteration 10079/ 11920 | consumed samples: 10320896 | elapsed time per iteration (ms): 5957.8 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814258E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:02:48.260916 | finish at 2025-09-10 12:03:17 + [2025-09-10 09:00:35] iteration 10080/ 11920 | consumed samples: 10321920 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804339E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:52:20.709419 | finish at 2025-09-10 11:52:55 + [2025-09-10 09:00:40] iteration 10081/ 11920 | consumed samples: 10322944 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817654E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:52:16.810392 | finish at 2025-09-10 11:52:57 + [2025-09-10 09:00:46] iteration 10082/ 11920 | consumed samples: 10323968 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801499E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:52:18.781113 | finish at 2025-09-10 11:53:05 + [2025-09-10 09:00:52] iteration 10083/ 11920 | consumed samples: 10324992 | elapsed time per iteration (ms): 6004.0 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.795759E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:03:49.414916 | finish at 2025-09-10 12:04:41 + [2025-09-10 09:00:57] iteration 10084/ 11920 | consumed samples: 10326016 | elapsed time per iteration (ms): 5629.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801078E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:52:15.492631 | finish at 2025-09-10 11:53:13 + [2025-09-10 09:01:03] iteration 10085/ 11920 | consumed samples: 10327040 | elapsed time per iteration (ms): 5879.3 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802695E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:59:48.450528 | finish at 2025-09-10 12:00:52 + [2025-09-10 09:01:09] iteration 10086/ 11920 | consumed samples: 10328064 | elapsed time per iteration (ms): 5858.5 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806365E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:59:04.472391 | finish at 2025-09-10 12:00:14 + [2025-09-10 09:01:15] iteration 10087/ 11920 | consumed samples: 10329088 | elapsed time per iteration (ms): 5992.2 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806000E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:03:03.733004 | finish at 2025-09-10 12:04:19 + [2025-09-10 09:01:21] iteration 10088/ 11920 | consumed samples: 10330112 | elapsed time per iteration (ms): 5628.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.796027E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:51:51.075212 | finish at 2025-09-10 11:53:12 + [2025-09-10 09:01:27] iteration 10089/ 11920 | consumed samples: 10331136 | elapsed time per iteration (ms): 5919.0 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811087E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:00:37.695384 | finish at 2025-09-10 12:02:04 + [2025-09-10 09:01:32] iteration 10090/ 11920 | consumed samples: 10332160 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808137E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:51:27.026961 | finish at 2025-09-10 11:52:59 + [2025-09-10 09:01:38] iteration 10091/ 11920 | consumed samples: 10333184 | elapsed time per iteration (ms): 5617.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804681E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:51:14.525360 | finish at 2025-09-10 11:52:53 + [2025-09-10 09:01:44] iteration 10092/ 11920 | consumed samples: 10334208 | elapsed time per iteration (ms): 5981.7 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801143E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 3:02:14.605462 | finish at 2025-09-10 12:03:59 + [2025-09-10 09:01:50] iteration 10093/ 11920 | consumed samples: 10335232 | elapsed time per iteration (ms): 5883.8 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806541E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:59:09.786288 | finish at 2025-09-10 12:01:00 + [2025-09-10 09:01:55] iteration 10094/ 11920 | consumed samples: 10336256 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815402E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:51:11.027100 | finish at 2025-09-10 11:53:07 + [2025-09-10 09:02:01] iteration 10095/ 11920 | consumed samples: 10337280 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807149E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:50:59.961122 | finish at 2025-09-10 11:53:01 + [2025-09-10 09:02:07] iteration 10096/ 11920 | consumed samples: 10338304 | elapsed time per iteration (ms): 5626.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806373E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:51:03.269829 | finish at 2025-09-10 11:53:10 + [2025-09-10 09:02:12] iteration 10097/ 11920 | consumed samples: 10339328 | elapsed time per iteration (ms): 5626.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815706E+00 | loss scale: 1.0 | grad norm: 0.238 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:50:57.215353 | finish at 2025-09-10 11:53:10 + [2025-09-10 09:02:18] iteration 10098/ 11920 | consumed samples: 10340352 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827646E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:50:40.441257 | finish at 2025-09-10 11:52:58 + [2025-09-10 09:02:24] iteration 10099/ 11920 | consumed samples: 10341376 | elapsed time per iteration (ms): 5618.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824688E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:50:31.909339 | finish at 2025-09-10 11:52:56 + [2025-09-10 09:02:29] iteration 10100/ 11920 | consumed samples: 10342400 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811883E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:50:28.346853 | finish at 2025-09-10 11:52:58 + [2025-09-10 09:02:35] iteration 10101/ 11920 | consumed samples: 10343424 | elapsed time per iteration (ms): 5619.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813442E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:50:21.602775 | finish at 2025-09-10 11:52:56 + [2025-09-10 09:02:41] iteration 10102/ 11920 | consumed samples: 10344448 | elapsed time per iteration (ms): 5845.9 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798898E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:57:07.842839 | finish at 2025-09-10 11:59:49 + [2025-09-10 09:02:46] iteration 10103/ 11920 | consumed samples: 10345472 | elapsed time per iteration (ms): 5627.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825972E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:50:25.784923 | finish at 2025-09-10 11:53:12 + [2025-09-10 09:02:52] iteration 10104/ 11920 | consumed samples: 10346496 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814908E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:50:10.289307 | finish at 2025-09-10 11:53:02 + [2025-09-10 09:02:58] iteration 10105/ 11920 | consumed samples: 10347520 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816802E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:50:08.027480 | finish at 2025-09-10 11:53:06 + [2025-09-10 09:03:03] iteration 10106/ 11920 | consumed samples: 10348544 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818399E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:50:03.778112 | finish at 2025-09-10 11:53:07 + [2025-09-10 09:03:09] iteration 10107/ 11920 | consumed samples: 10349568 | elapsed time per iteration (ms): 5626.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813475E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:50:00.310903 | finish at 2025-09-10 11:53:09 + [2025-09-10 09:03:14] iteration 10108/ 11920 | consumed samples: 10350592 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804082E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:49:49.308709 | finish at 2025-09-10 11:53:04 + [2025-09-10 09:03:20] iteration 10109/ 11920 | consumed samples: 10351616 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801505E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:49:41.700596 | finish at 2025-09-10 11:53:02 + [2025-09-10 09:03:26] iteration 10110/ 11920 | consumed samples: 10352640 | elapsed time per iteration (ms): 5615.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800085E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:49:24.656515 | finish at 2025-09-10 11:52:50 + [2025-09-10 09:03:31] iteration 10111/ 11920 | consumed samples: 10353664 | elapsed time per iteration (ms): 5615.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807152E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:49:17.934832 | finish at 2025-09-10 11:52:49 + [2025-09-10 09:03:37] iteration 10112/ 11920 | consumed samples: 10354688 | elapsed time per iteration (ms): 5839.8 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.795003E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:55:58.418102 | finish at 2025-09-10 11:59:36 + [2025-09-10 09:03:43] iteration 10113/ 11920 | consumed samples: 10355712 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816677E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:49:15.280771 | finish at 2025-09-10 11:52:58 + [2025-09-10 09:03:48] iteration 10114/ 11920 | consumed samples: 10356736 | elapsed time per iteration (ms): 5629.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815554E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:49:26.065191 | finish at 2025-09-10 11:53:14 + [2025-09-10 09:03:54] iteration 10115/ 11920 | consumed samples: 10357760 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.795133E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:49:11.983293 | finish at 2025-09-10 11:53:06 + [2025-09-10 09:04:00] iteration 10116/ 11920 | consumed samples: 10358784 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807066E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:48:59.617857 | finish at 2025-09-10 11:52:59 + [2025-09-10 09:04:05] iteration 10117/ 11920 | consumed samples: 10359808 | elapsed time per iteration (ms): 5615.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812146E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:48:44.844032 | finish at 2025-09-10 11:52:50 + [2025-09-10 09:04:11] iteration 10118/ 11920 | consumed samples: 10360832 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810694E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:48:57.147068 | finish at 2025-09-10 11:53:08 + [2025-09-10 09:04:16] iteration 10119/ 11920 | consumed samples: 10361856 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799620E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:48:50.687262 | finish at 2025-09-10 11:53:07 + [2025-09-10 09:04:22] iteration 10120/ 11920 | consumed samples: 10362880 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800654E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:48:42.477865 | finish at 2025-09-10 11:53:05 + [2025-09-10 09:04:28] iteration 10121/ 11920 | consumed samples: 10363904 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812719E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:48:30.955398 | finish at 2025-09-10 11:52:59 + [2025-09-10 09:04:33] iteration 10122/ 11920 | consumed samples: 10364928 | elapsed time per iteration (ms): 5621.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814140E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:48:26.524227 | finish at 2025-09-10 11:53:00 + [2025-09-10 09:04:39] iteration 10123/ 11920 | consumed samples: 10365952 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806599E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:48:21.199725 | finish at 2025-09-10 11:53:00 + [2025-09-10 09:04:45] iteration 10124/ 11920 | consumed samples: 10366976 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807120E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:48:20.985457 | finish at 2025-09-10 11:53:06 + [2025-09-10 09:04:50] iteration 10125/ 11920 | consumed samples: 10368000 | elapsed time per iteration (ms): 5634.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808286E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 10.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:48:33.598018 | finish at 2025-09-10 11:53:24 + [2025-09-10 09:04:56] iteration 10126/ 11920 | consumed samples: 10369024 | elapsed time per iteration (ms): 5630.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826871E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:48:21.581220 | finish at 2025-09-10 11:53:17 + [2025-09-10 09:05:01] iteration 10127/ 11920 | consumed samples: 10370048 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806996E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:48:07.092554 | finish at 2025-09-10 11:53:09 + [2025-09-10 09:05:07] iteration 10128/ 11920 | consumed samples: 10371072 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815421E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:47:51.797302 | finish at 2025-09-10 11:52:59 + [2025-09-10 09:05:13] iteration 10129/ 11920 | consumed samples: 10372096 | elapsed time per iteration (ms): 5626.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810951E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:47:56.927225 | finish at 2025-09-10 11:53:10 + [2025-09-10 09:05:18] iteration 10130/ 11920 | consumed samples: 10373120 | elapsed time per iteration (ms): 5624.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818744E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:47:47.961757 | finish at 2025-09-10 11:53:06 + [2025-09-10 09:05:24] iteration 10131/ 11920 | consumed samples: 10374144 | elapsed time per iteration (ms): 5953.8 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804698E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:57:31.387375 | finish at 2025-09-10 12:02:56 + [2025-09-10 09:05:30] iteration 10132/ 11920 | consumed samples: 10375168 | elapsed time per iteration (ms): 5899.6 | throughput per GPU (TFLOP/s/GPU): 76.5 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.794900E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:55:48.430286 | finish at 2025-09-10 12:01:19 + [2025-09-10 09:05:36] iteration 10133/ 11920 | consumed samples: 10376192 | elapsed time per iteration (ms): 5617.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810380E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:47:18.816019 | finish at 2025-09-10 11:52:55 + [2025-09-10 09:05:41] iteration 10134/ 11920 | consumed samples: 10377216 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802586E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:47:19.628568 | finish at 2025-09-10 11:53:01 + [2025-09-10 09:05:47] iteration 10135/ 11920 | consumed samples: 10378240 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815113E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:47:15.532969 | finish at 2025-09-10 11:53:03 + [2025-09-10 09:05:53] iteration 10136/ 11920 | consumed samples: 10379264 | elapsed time per iteration (ms): 5632.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816487E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:47:28.236967 | finish at 2025-09-10 11:53:21 + [2025-09-10 09:05:58] iteration 10137/ 11920 | consumed samples: 10380288 | elapsed time per iteration (ms): 5629.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.797521E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:47:17.498242 | finish at 2025-09-10 11:53:16 + [2025-09-10 09:06:04] iteration 10138/ 11920 | consumed samples: 10381312 | elapsed time per iteration (ms): 5928.9 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829773E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:56:05.376706 | finish at 2025-09-10 12:02:10 + [2025-09-10 09:06:10] iteration 10139/ 11920 | consumed samples: 10382336 | elapsed time per iteration (ms): 5627.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810174E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:47:02.775901 | finish at 2025-09-10 11:53:13 + [2025-09-10 09:06:16] iteration 10140/ 11920 | consumed samples: 10383360 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818116E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:46:46.309919 | finish at 2025-09-10 11:53:02 + [2025-09-10 09:06:21] iteration 10141/ 11920 | consumed samples: 10384384 | elapsed time per iteration (ms): 5617.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825851E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:46:33.497415 | finish at 2025-09-10 11:52:55 + [2025-09-10 09:06:27] iteration 10142/ 11920 | consumed samples: 10385408 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822897E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:46:32.592098 | finish at 2025-09-10 11:52:59 + [2025-09-10 09:06:32] iteration 10143/ 11920 | consumed samples: 10386432 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799521E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:46:30.809569 | finish at 2025-09-10 11:53:03 + [2025-09-10 09:06:38] iteration 10144/ 11920 | consumed samples: 10387456 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800349E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:46:21.980209 | finish at 2025-09-10 11:53:00 + [2025-09-10 09:06:44] iteration 10145/ 11920 | consumed samples: 10388480 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818945E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:46:22.755017 | finish at 2025-09-10 11:53:06 + [2025-09-10 09:06:49] iteration 10146/ 11920 | consumed samples: 10389504 | elapsed time per iteration (ms): 5630.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802223E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:46:27.737361 | finish at 2025-09-10 11:53:17 + [2025-09-10 09:06:55] iteration 10147/ 11920 | consumed samples: 10390528 | elapsed time per iteration (ms): 5629.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822654E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:46:20.155615 | finish at 2025-09-10 11:53:15 + [2025-09-10 09:07:01] iteration 10148/ 11920 | consumed samples: 10391552 | elapsed time per iteration (ms): 5945.6 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807799E+00 | loss scale: 1.0 | grad norm: 0.266 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:55:35.543689 | finish at 2025-09-10 12:02:36 + [2025-09-10 09:07:06] iteration 10149/ 11920 | consumed samples: 10392576 | elapsed time per iteration (ms): 5627.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832876E+00 | loss scale: 1.0 | grad norm: 0.268 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:46:06.992540 | finish at 2025-09-10 11:53:13 + [2025-09-10 09:07:12] iteration 10150/ 11920 | consumed samples: 10393600 | elapsed time per iteration (ms): 5861.8 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821930E+00 | loss scale: 1.0 | grad norm: 0.262 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:52:55.302365 | finish at 2025-09-10 12:00:08 + [2025-09-10 09:07:18] iteration 10151/ 11920 | consumed samples: 10394624 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821201E+00 | loss scale: 1.0 | grad norm: 0.258 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:45:46.112563 | finish at 2025-09-10 11:53:04 + [2025-09-10 09:07:24] iteration 10152/ 11920 | consumed samples: 10395648 | elapsed time per iteration (ms): 5625.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818752E+00 | loss scale: 1.0 | grad norm: 0.264 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:45:46.665020 | finish at 2025-09-10 11:53:10 + [2025-09-10 09:07:29] iteration 10153/ 11920 | consumed samples: 10396672 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813523E+00 | loss scale: 1.0 | grad norm: 0.243 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:45:33.232656 | finish at 2025-09-10 11:53:02 + [2025-09-10 09:07:35] iteration 10154/ 11920 | consumed samples: 10397696 | elapsed time per iteration (ms): 5616.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817203E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:45:18.631458 | finish at 2025-09-10 11:52:53 + [2025-09-10 09:07:41] iteration 10155/ 11920 | consumed samples: 10398720 | elapsed time per iteration (ms): 5847.9 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817747E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:52:01.542506 | finish at 2025-09-10 11:59:42 + [2025-09-10 09:07:47] iteration 10156/ 11920 | consumed samples: 10399744 | elapsed time per iteration (ms): 6002.0 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813337E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:56:27.542885 | finish at 2025-09-10 12:04:14 + [2025-09-10 09:07:52] iteration 10157/ 11920 | consumed samples: 10400768 | elapsed time per iteration (ms): 5627.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809769E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:45:21.415846 | finish at 2025-09-10 11:53:14 + [2025-09-10 09:07:58] iteration 10158/ 11920 | consumed samples: 10401792 | elapsed time per iteration (ms): 5635.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.792150E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:45:30.436932 | finish at 2025-09-10 11:53:28 + [2025-09-10 09:08:04] iteration 10159/ 11920 | consumed samples: 10402816 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818125E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:44:59.685310 | finish at 2025-09-10 11:53:03 + [2025-09-10 09:08:09] iteration 10160/ 11920 | consumed samples: 10403840 | elapsed time per iteration (ms): 5627.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798602E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:45:04.495354 | finish at 2025-09-10 11:53:14 + [2025-09-10 09:08:15] iteration 10161/ 11920 | consumed samples: 10404864 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813553E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:44:53.016634 | finish at 2025-09-10 11:53:08 + [2025-09-10 09:08:20] iteration 10162/ 11920 | consumed samples: 10405888 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807055E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:44:49.154051 | finish at 2025-09-10 11:53:10 + [2025-09-10 09:08:26] iteration 10163/ 11920 | consumed samples: 10406912 | elapsed time per iteration (ms): 5635.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812970E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:45:01.035550 | finish at 2025-09-10 11:53:27 + [2025-09-10 09:08:32] iteration 10164/ 11920 | consumed samples: 10407936 | elapsed time per iteration (ms): 5632.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804773E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:44:50.996859 | finish at 2025-09-10 11:53:23 + [2025-09-10 09:08:37] iteration 10165/ 11920 | consumed samples: 10408960 | elapsed time per iteration (ms): 5628.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813950E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:44:38.168943 | finish at 2025-09-10 11:53:15 + [2025-09-10 09:08:43] iteration 10166/ 11920 | consumed samples: 10409984 | elapsed time per iteration (ms): 5629.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804647E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:44:34.164592 | finish at 2025-09-10 11:53:17 + [2025-09-10 09:08:49] iteration 10167/ 11920 | consumed samples: 10411008 | elapsed time per iteration (ms): 5876.7 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803391E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:51:41.824039 | finish at 2025-09-10 12:00:31 + [2025-09-10 09:08:54] iteration 10168/ 11920 | consumed samples: 10412032 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818954E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:44:12.087313 | finish at 2025-09-10 11:53:07 + [2025-09-10 09:09:00] iteration 10169/ 11920 | consumed samples: 10413056 | elapsed time per iteration (ms): 5628.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807638E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:44:16.081671 | finish at 2025-09-10 11:53:16 + [2025-09-10 09:09:06] iteration 10170/ 11920 | consumed samples: 10414080 | elapsed time per iteration (ms): 5628.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811351E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:44:09.518239 | finish at 2025-09-10 11:53:15 + [2025-09-10 09:09:11] iteration 10171/ 11920 | consumed samples: 10415104 | elapsed time per iteration (ms): 5641.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804306E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:44:26.361755 | finish at 2025-09-10 11:53:38 + [2025-09-10 09:09:17] iteration 10172/ 11920 | consumed samples: 10416128 | elapsed time per iteration (ms): 5933.7 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807024E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:52:52.069823 | finish at 2025-09-10 12:02:09 + [2025-09-10 09:09:23] iteration 10173/ 11920 | consumed samples: 10417152 | elapsed time per iteration (ms): 5616.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818649E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:43:32.061147 | finish at 2025-09-10 11:52:55 + [2025-09-10 09:09:29] iteration 10174/ 11920 | consumed samples: 10418176 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816225E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:43:33.298242 | finish at 2025-09-10 11:53:02 + [2025-09-10 09:09:34] iteration 10175/ 11920 | consumed samples: 10419200 | elapsed time per iteration (ms): 5628.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805028E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:43:41.252363 | finish at 2025-09-10 11:53:15 + [2025-09-10 09:09:40] iteration 10176/ 11920 | consumed samples: 10420224 | elapsed time per iteration (ms): 5627.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806009E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:43:33.464046 | finish at 2025-09-10 11:53:13 + [2025-09-10 09:09:45] iteration 10177/ 11920 | consumed samples: 10421248 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824194E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:43:21.589062 | finish at 2025-09-10 11:53:07 + [2025-09-10 09:09:51] iteration 10178/ 11920 | consumed samples: 10422272 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807356E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:43:16.111439 | finish at 2025-09-10 11:53:07 + [2025-09-10 09:09:57] iteration 10179/ 11920 | consumed samples: 10423296 | elapsed time per iteration (ms): 5891.7 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824723E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:50:57.372545 | finish at 2025-09-10 12:00:54 + [2025-09-10 09:10:03] iteration 10180/ 11920 | consumed samples: 10424320 | elapsed time per iteration (ms): 5630.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802656E+00 | loss scale: 1.0 | grad norm: 0.137 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:43:17.472539 | finish at 2025-09-10 11:53:20 + [2025-09-10 09:10:08] iteration 10181/ 11920 | consumed samples: 10425344 | elapsed time per iteration (ms): 5631.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816994E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:43:12.335193 | finish at 2025-09-10 11:53:21 + [2025-09-10 09:10:14] iteration 10182/ 11920 | consumed samples: 10426368 | elapsed time per iteration (ms): 5856.1 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802752E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:49:37.943380 | finish at 2025-09-10 11:59:52 + [2025-09-10 09:10:20] iteration 10183/ 11920 | consumed samples: 10427392 | elapsed time per iteration (ms): 5952.0 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815097E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:52:18.573721 | finish at 2025-09-10 12:02:39 + [2025-09-10 09:10:26] iteration 10184/ 11920 | consumed samples: 10428416 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805140E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:42:41.970705 | finish at 2025-09-10 11:53:08 + [2025-09-10 09:10:31] iteration 10185/ 11920 | consumed samples: 10429440 | elapsed time per iteration (ms): 5617.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811345E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:42:26.079675 | finish at 2025-09-10 11:52:57 + [2025-09-10 09:10:37] iteration 10186/ 11920 | consumed samples: 10430464 | elapsed time per iteration (ms): 5948.6 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818779E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:51:54.841900 | finish at 2025-09-10 12:02:32 + [2025-09-10 09:10:43] iteration 10187/ 11920 | consumed samples: 10431488 | elapsed time per iteration (ms): 5623.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804100E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:42:24.783205 | finish at 2025-09-10 11:53:08 + [2025-09-10 09:10:48] iteration 10188/ 11920 | consumed samples: 10432512 | elapsed time per iteration (ms): 5646.7 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811756E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:43:00.148655 | finish at 2025-09-10 11:53:49 + [2025-09-10 09:10:54] iteration 10189/ 11920 | consumed samples: 10433536 | elapsed time per iteration (ms): 5630.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803479E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:42:25.762550 | finish at 2025-09-10 11:53:20 + [2025-09-10 09:11:00] iteration 10190/ 11920 | consumed samples: 10434560 | elapsed time per iteration (ms): 5631.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814765E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:42:22.294552 | finish at 2025-09-10 11:53:22 + [2025-09-10 09:11:05] iteration 10191/ 11920 | consumed samples: 10435584 | elapsed time per iteration (ms): 5626.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824000E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:42:08.511817 | finish at 2025-09-10 11:53:14 + [2025-09-10 09:11:11] iteration 10192/ 11920 | consumed samples: 10436608 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807268E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:41:54.364838 | finish at 2025-09-10 11:53:05 + [2025-09-10 09:11:17] iteration 10193/ 11920 | consumed samples: 10437632 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.790151E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:41:50.815019 | finish at 2025-09-10 11:53:07 + [2025-09-10 09:11:22] iteration 10194/ 11920 | consumed samples: 10438656 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809923E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:41:43.367855 | finish at 2025-09-10 11:53:06 + [2025-09-10 09:11:28] iteration 10195/ 11920 | consumed samples: 10439680 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812288E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:41:39.182135 | finish at 2025-09-10 11:53:07 + [2025-09-10 09:11:34] iteration 10196/ 11920 | consumed samples: 10440704 | elapsed time per iteration (ms): 5843.9 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808434E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:47:54.852302 | finish at 2025-09-10 11:59:29 + [2025-09-10 09:11:39] iteration 10197/ 11920 | consumed samples: 10441728 | elapsed time per iteration (ms): 5619.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809892E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:41:21.817912 | finish at 2025-09-10 11:53:01 + [2025-09-10 09:11:45] iteration 10198/ 11920 | consumed samples: 10442752 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808267E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:41:25.786071 | finish at 2025-09-10 11:53:11 + [2025-09-10 09:11:51] iteration 10199/ 11920 | consumed samples: 10443776 | elapsed time per iteration (ms): 6216.4 | throughput per GPU (TFLOP/s/GPU): 72.6 | MFU 7.34% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810335E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:58:18.393468 | finish at 2025-09-10 12:10:10 + [2025-09-10 09:11:57] iteration 10200/ 11920 | consumed samples: 10444800 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819356E+00 | loss scale: 1.0 | grad norm: 0.268 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:41:14.530458 | finish at 2025-09-10 11:53:11 + [2025-09-10 09:12:03] iteration 10201/ 11920 | consumed samples: 10445824 | elapsed time per iteration (ms): 6190.4 | throughput per GPU (TFLOP/s/GPU): 72.9 | MFU 7.37% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815514E+00 | loss scale: 1.0 | grad norm: 0.268 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:57:21.246992 | finish at 2025-09-10 12:09:24 + [2025-09-10 09:12:09] iteration 10202/ 11920 | consumed samples: 10446848 | elapsed time per iteration (ms): 5632.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815006E+00 | loss scale: 1.0 | grad norm: 0.229 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:41:15.993447 | finish at 2025-09-10 11:53:25 + [2025-09-10 09:12:14] iteration 10203/ 11920 | consumed samples: 10447872 | elapsed time per iteration (ms): 5917.7 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819133E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:49:20.687715 | finish at 2025-09-10 12:01:35 + [2025-09-10 09:12:20] iteration 10204/ 11920 | consumed samples: 10448896 | elapsed time per iteration (ms): 5923.9 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815827E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:49:25.464578 | finish at 2025-09-10 12:01:46 + [2025-09-10 09:12:26] iteration 10205/ 11920 | consumed samples: 10449920 | elapsed time per iteration (ms): 5628.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803824E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:40:53.442557 | finish at 2025-09-10 11:53:19 + [2025-09-10 09:12:32] iteration 10206/ 11920 | consumed samples: 10450944 | elapsed time per iteration (ms): 5625.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808144E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:40:42.770585 | finish at 2025-09-10 11:53:14 + [2025-09-10 09:12:37] iteration 10207/ 11920 | consumed samples: 10451968 | elapsed time per iteration (ms): 5616.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813130E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:40:21.663469 | finish at 2025-09-10 11:52:59 + [2025-09-10 09:12:43] iteration 10208/ 11920 | consumed samples: 10452992 | elapsed time per iteration (ms): 5846.3 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812670E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:46:48.831528 | finish at 2025-09-10 11:59:32 + [2025-09-10 09:12:49] iteration 10209/ 11920 | consumed samples: 10454016 | elapsed time per iteration (ms): 5635.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812353E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:40:41.987151 | finish at 2025-09-10 11:53:31 + [2025-09-10 09:12:54] iteration 10210/ 11920 | consumed samples: 10455040 | elapsed time per iteration (ms): 5632.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807988E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:40:31.174121 | finish at 2025-09-10 11:53:26 + [2025-09-10 09:13:00] iteration 10211/ 11920 | consumed samples: 10456064 | elapsed time per iteration (ms): 5630.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.797619E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:40:22.055243 | finish at 2025-09-10 11:53:22 + [2025-09-10 09:13:06] iteration 10212/ 11920 | consumed samples: 10457088 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802208E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:40:06.011208 | finish at 2025-09-10 11:53:12 + [2025-09-10 09:13:11] iteration 10213/ 11920 | consumed samples: 10458112 | elapsed time per iteration (ms): 5625.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813243E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:40:03.492341 | finish at 2025-09-10 11:53:15 + [2025-09-10 09:13:17] iteration 10214/ 11920 | consumed samples: 10459136 | elapsed time per iteration (ms): 6206.2 | throughput per GPU (TFLOP/s/GPU): 72.7 | MFU 7.36% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808724E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:56:27.737549 | finish at 2025-09-10 12:09:45 + [2025-09-10 09:13:23] iteration 10215/ 11920 | consumed samples: 10460160 | elapsed time per iteration (ms): 5843.1 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827736E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:46:02.565296 | finish at 2025-09-10 11:59:26 + [2025-09-10 09:13:29] iteration 10216/ 11920 | consumed samples: 10461184 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802670E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:39:43.363157 | finish at 2025-09-10 11:53:12 + [2025-09-10 09:13:35] iteration 10217/ 11920 | consumed samples: 10462208 | elapsed time per iteration (ms): 5827.7 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798845E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:45:24.530702 | finish at 2025-09-10 11:58:59 + [2025-09-10 09:13:40] iteration 10218/ 11920 | consumed samples: 10463232 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806025E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:39:25.535625 | finish at 2025-09-10 11:53:06 + [2025-09-10 09:13:46] iteration 10219/ 11920 | consumed samples: 10464256 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801608E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:39:26.053045 | finish at 2025-09-10 11:53:12 + [2025-09-10 09:13:52] iteration 10220/ 11920 | consumed samples: 10465280 | elapsed time per iteration (ms): 5629.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.792357E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:39:29.911528 | finish at 2025-09-10 11:53:22 + [2025-09-10 09:13:57] iteration 10221/ 11920 | consumed samples: 10466304 | elapsed time per iteration (ms): 5628.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.794865E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:39:22.172547 | finish at 2025-09-10 11:53:19 + [2025-09-10 09:14:03] iteration 10222/ 11920 | consumed samples: 10467328 | elapsed time per iteration (ms): 5617.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809044E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:38:59.044232 | finish at 2025-09-10 11:53:02 + [2025-09-10 09:14:09] iteration 10223/ 11920 | consumed samples: 10468352 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802666E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:39:05.804236 | finish at 2025-09-10 11:53:14 + [2025-09-10 09:14:14] iteration 10224/ 11920 | consumed samples: 10469376 | elapsed time per iteration (ms): 5628.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810235E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:39:05.066200 | finish at 2025-09-10 11:53:19 + [2025-09-10 09:14:20] iteration 10225/ 11920 | consumed samples: 10470400 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807367E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:38:48.387566 | finish at 2025-09-10 11:53:08 + [2025-09-10 09:14:25] iteration 10226/ 11920 | consumed samples: 10471424 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816695E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:38:49.202347 | finish at 2025-09-10 11:53:15 + [2025-09-10 09:14:31] iteration 10227/ 11920 | consumed samples: 10472448 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811589E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:38:44.416656 | finish at 2025-09-10 11:53:15 + [2025-09-10 09:14:37] iteration 10228/ 11920 | consumed samples: 10473472 | elapsed time per iteration (ms): 5623.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815267E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:38:34.153762 | finish at 2025-09-10 11:53:11 + [2025-09-10 09:14:42] iteration 10229/ 11920 | consumed samples: 10474496 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817217E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:38:32.742613 | finish at 2025-09-10 11:53:15 + [2025-09-10 09:14:48] iteration 10230/ 11920 | consumed samples: 10475520 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805147E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:38:24.810340 | finish at 2025-09-10 11:53:13 + [2025-09-10 09:14:54] iteration 10231/ 11920 | consumed samples: 10476544 | elapsed time per iteration (ms): 5629.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821097E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:38:27.415544 | finish at 2025-09-10 11:53:21 + [2025-09-10 09:14:59] iteration 10232/ 11920 | consumed samples: 10477568 | elapsed time per iteration (ms): 5636.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815410E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:38:33.532446 | finish at 2025-09-10 11:53:33 + [2025-09-10 09:15:05] iteration 10233/ 11920 | consumed samples: 10478592 | elapsed time per iteration (ms): 5934.4 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820475E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:46:51.353853 | finish at 2025-09-10 12:01:56 + [2025-09-10 09:15:11] iteration 10234/ 11920 | consumed samples: 10479616 | elapsed time per iteration (ms): 5628.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799222E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:38:09.359543 | finish at 2025-09-10 11:53:20 + [2025-09-10 09:15:16] iteration 10235/ 11920 | consumed samples: 10480640 | elapsed time per iteration (ms): 5629.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807219E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:38:04.915934 | finish at 2025-09-10 11:53:21 + [2025-09-10 09:15:22] iteration 10236/ 11920 | consumed samples: 10481664 | elapsed time per iteration (ms): 5623.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809047E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:37:49.319743 | finish at 2025-09-10 11:53:11 + [2025-09-10 09:15:28] iteration 10237/ 11920 | consumed samples: 10482688 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810044E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:37:39.187289 | finish at 2025-09-10 11:53:07 + [2025-09-10 09:15:33] iteration 10238/ 11920 | consumed samples: 10483712 | elapsed time per iteration (ms): 5616.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813504E+00 | loss scale: 1.0 | grad norm: 0.133 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:37:26.731470 | finish at 2025-09-10 11:53:00 + [2025-09-10 09:15:39] iteration 10239/ 11920 | consumed samples: 10484736 | elapsed time per iteration (ms): 5618.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798183E+00 | loss scale: 1.0 | grad norm: 0.137 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:37:23.806350 | finish at 2025-09-10 11:53:03 + [2025-09-10 09:15:44] iteration 10240/ 11920 | consumed samples: 10485760 | elapsed time per iteration (ms): 5618.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809351E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:37:18.880520 | finish at 2025-09-10 11:53:03 + [2025-09-10 09:15:50] iteration 10241/ 11920 | consumed samples: 10486784 | elapsed time per iteration (ms): 5983.2 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798100E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:47:25.718466 | finish at 2025-09-10 12:03:16 + [2025-09-10 09:15:56] iteration 10242/ 11920 | consumed samples: 10487808 | elapsed time per iteration (ms): 5625.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814117E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:37:20.252649 | finish at 2025-09-10 11:53:16 + [2025-09-10 09:16:02] iteration 10243/ 11920 | consumed samples: 10488832 | elapsed time per iteration (ms): 5634.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804861E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:37:29.374408 | finish at 2025-09-10 11:53:31 + [2025-09-10 09:16:07] iteration 10244/ 11920 | consumed samples: 10489856 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800193E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:37:03.824176 | finish at 2025-09-10 11:53:11 + [2025-09-10 09:16:13] iteration 10245/ 11920 | consumed samples: 10490880 | elapsed time per iteration (ms): 5956.9 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810496E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:46:17.832860 | finish at 2025-09-10 12:02:31 + [2025-09-10 09:16:19] iteration 10246/ 11920 | consumed samples: 10491904 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814272E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:36:52.475591 | finish at 2025-09-10 11:53:11 + [2025-09-10 09:16:25] iteration 10247/ 11920 | consumed samples: 10492928 | elapsed time per iteration (ms): 5971.6 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.796219E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:46:30.478517 | finish at 2025-09-10 12:02:55 + [2025-09-10 09:16:31] iteration 10248/ 11920 | consumed samples: 10493952 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811417E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:36:39.940115 | finish at 2025-09-10 11:53:10 + [2025-09-10 09:16:36] iteration 10249/ 11920 | consumed samples: 10494976 | elapsed time per iteration (ms): 5841.5 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819762E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:42:41.086814 | finish at 2025-09-10 11:59:17 + [2025-09-10 09:16:42] iteration 10250/ 11920 | consumed samples: 10496000 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804018E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:36:25.199137 | finish at 2025-09-10 11:53:07 + [2025-09-10 09:16:48] iteration 10251/ 11920 | consumed samples: 10497024 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804005E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:36:27.343882 | finish at 2025-09-10 11:53:15 + [2025-09-10 09:16:53] iteration 10252/ 11920 | consumed samples: 10498048 | elapsed time per iteration (ms): 5627.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798467E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:36:25.862801 | finish at 2025-09-10 11:53:19 + [2025-09-10 09:16:59] iteration 10253/ 11920 | consumed samples: 10499072 | elapsed time per iteration (ms): 5843.4 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804608E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:42:20.955354 | finish at 2025-09-10 11:59:20 + [2025-09-10 09:17:05] iteration 10254/ 11920 | consumed samples: 10500096 | elapsed time per iteration (ms): 5853.7 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815374E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:42:32.287109 | finish at 2025-09-10 11:59:37 + [2025-09-10 09:17:11] iteration 10255/ 11920 | consumed samples: 10501120 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804654E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:36:04.141138 | finish at 2025-09-10 11:53:15 + [2025-09-10 09:17:16] iteration 10256/ 11920 | consumed samples: 10502144 | elapsed time per iteration (ms): 5628.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810837E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:36:04.981720 | finish at 2025-09-10 11:53:21 + [2025-09-10 09:17:22] iteration 10257/ 11920 | consumed samples: 10503168 | elapsed time per iteration (ms): 5628.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807164E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:35:59.323989 | finish at 2025-09-10 11:53:21 + [2025-09-10 09:17:27] iteration 10258/ 11920 | consumed samples: 10504192 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.795888E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:35:44.213711 | finish at 2025-09-10 11:53:12 + [2025-09-10 09:17:33] iteration 10259/ 11920 | consumed samples: 10505216 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809762E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:35:39.576325 | finish at 2025-09-10 11:53:13 + [2025-09-10 09:17:39] iteration 10260/ 11920 | consumed samples: 10506240 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815387E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:35:31.767201 | finish at 2025-09-10 11:53:10 + [2025-09-10 09:17:44] iteration 10261/ 11920 | consumed samples: 10507264 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817225E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:35:25.874712 | finish at 2025-09-10 11:53:10 + [2025-09-10 09:17:50] iteration 10262/ 11920 | consumed samples: 10508288 | elapsed time per iteration (ms): 5634.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801207E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:35:41.095021 | finish at 2025-09-10 11:53:31 + [2025-09-10 09:17:56] iteration 10263/ 11920 | consumed samples: 10509312 | elapsed time per iteration (ms): 5631.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813727E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:35:30.943166 | finish at 2025-09-10 11:53:26 + [2025-09-10 09:18:01] iteration 10264/ 11920 | consumed samples: 10510336 | elapsed time per iteration (ms): 5626.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814206E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:35:17.579762 | finish at 2025-09-10 11:53:19 + [2025-09-10 09:18:07] iteration 10265/ 11920 | consumed samples: 10511360 | elapsed time per iteration (ms): 5634.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811678E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:35:24.845800 | finish at 2025-09-10 11:53:32 + [2025-09-10 09:18:12] iteration 10266/ 11920 | consumed samples: 10512384 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799528E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:35:02.968804 | finish at 2025-09-10 11:53:15 + [2025-09-10 09:18:18] iteration 10267/ 11920 | consumed samples: 10513408 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807262E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:34:56.767699 | finish at 2025-09-10 11:53:15 + [2025-09-10 09:18:24] iteration 10268/ 11920 | consumed samples: 10514432 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812895E+00 | loss scale: 1.0 | grad norm: 0.259 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:34:48.757865 | finish at 2025-09-10 11:53:12 + [2025-09-10 09:18:29] iteration 10269/ 11920 | consumed samples: 10515456 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812857E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:34:43.850354 | finish at 2025-09-10 11:53:13 + [2025-09-10 09:18:35] iteration 10270/ 11920 | consumed samples: 10516480 | elapsed time per iteration (ms): 5630.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810337E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:34:49.916003 | finish at 2025-09-10 11:53:25 + [2025-09-10 09:18:41] iteration 10271/ 11920 | consumed samples: 10517504 | elapsed time per iteration (ms): 5843.3 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800946E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:40:35.600299 | finish at 2025-09-10 11:59:16 + [2025-09-10 09:18:46] iteration 10272/ 11920 | consumed samples: 10518528 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.797730E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:34:26.405624 | finish at 2025-09-10 11:53:13 + [2025-09-10 09:18:52] iteration 10273/ 11920 | consumed samples: 10519552 | elapsed time per iteration (ms): 5633.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809641E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:34:37.658816 | finish at 2025-09-10 11:53:30 + [2025-09-10 09:18:58] iteration 10274/ 11920 | consumed samples: 10520576 | elapsed time per iteration (ms): 5626.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809541E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:34:21.179970 | finish at 2025-09-10 11:53:19 + [2025-09-10 09:19:03] iteration 10275/ 11920 | consumed samples: 10521600 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808963E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:34:11.436585 | finish at 2025-09-10 11:53:15 + [2025-09-10 09:19:09] iteration 10276/ 11920 | consumed samples: 10522624 | elapsed time per iteration (ms): 5623.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819319E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:34:04.295334 | finish at 2025-09-10 11:53:13 + [2025-09-10 09:19:15] iteration 10277/ 11920 | consumed samples: 10523648 | elapsed time per iteration (ms): 5632.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801518E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:34:13.745735 | finish at 2025-09-10 11:53:28 + [2025-09-10 09:19:20] iteration 10278/ 11920 | consumed samples: 10524672 | elapsed time per iteration (ms): 5848.6 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824592E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:40:03.420236 | finish at 2025-09-10 11:59:24 + [2025-09-10 09:19:26] iteration 10279/ 11920 | consumed samples: 10525696 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798080E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:33:44.428855 | finish at 2025-09-10 11:53:10 + [2025-09-10 09:19:32] iteration 10280/ 11920 | consumed samples: 10526720 | elapsed time per iteration (ms): 5631.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817634E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:33:55.850430 | finish at 2025-09-10 11:53:27 + [2025-09-10 09:19:37] iteration 10281/ 11920 | consumed samples: 10527744 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824525E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:33:31.661239 | finish at 2025-09-10 11:53:09 + [2025-09-10 09:19:43] iteration 10282/ 11920 | consumed samples: 10528768 | elapsed time per iteration (ms): 5625.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.797995E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:33:35.220344 | finish at 2025-09-10 11:53:18 + [2025-09-10 09:19:49] iteration 10283/ 11920 | consumed samples: 10529792 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821883E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:33:19.829360 | finish at 2025-09-10 11:53:08 + [2025-09-10 09:19:55] iteration 10284/ 11920 | consumed samples: 10530816 | elapsed time per iteration (ms): 6269.5 | throughput per GPU (TFLOP/s/GPU): 72.0 | MFU 7.28% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.796984E+00 | loss scale: 1.0 | grad norm: 0.260 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:50:56.898518 | finish at 2025-09-10 12:10:52 + [2025-09-10 09:20:00] iteration 10285/ 11920 | consumed samples: 10531840 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799729E+00 | loss scale: 1.0 | grad norm: 0.253 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:33:18.933220 | finish at 2025-09-10 11:53:19 + [2025-09-10 09:20:06] iteration 10286/ 11920 | consumed samples: 10532864 | elapsed time per iteration (ms): 5627.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813041E+00 | loss scale: 1.0 | grad norm: 0.240 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:33:14.543865 | finish at 2025-09-10 11:53:21 + [2025-09-10 09:20:12] iteration 10287/ 11920 | consumed samples: 10533888 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817935E+00 | loss scale: 1.0 | grad norm: 0.252 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:32:58.150498 | finish at 2025-09-10 11:53:10 + [2025-09-10 09:20:18] iteration 10288/ 11920 | consumed samples: 10534912 | elapsed time per iteration (ms): 5949.3 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816522E+00 | loss scale: 1.0 | grad norm: 0.263 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:41:49.239784 | finish at 2025-09-10 12:02:07 + [2025-09-10 09:20:23] iteration 10289/ 11920 | consumed samples: 10535936 | elapsed time per iteration (ms): 5628.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801361E+00 | loss scale: 1.0 | grad norm: 0.263 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:32:59.570568 | finish at 2025-09-10 11:53:23 + [2025-09-10 09:20:29] iteration 10290/ 11920 | consumed samples: 10536960 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803946E+00 | loss scale: 1.0 | grad norm: 0.252 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:32:47.988300 | finish at 2025-09-10 11:53:17 + [2025-09-10 09:20:35] iteration 10291/ 11920 | consumed samples: 10537984 | elapsed time per iteration (ms): 5962.3 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814857E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:41:52.605444 | finish at 2025-09-10 12:02:27 + [2025-09-10 09:20:40] iteration 10292/ 11920 | consumed samples: 10539008 | elapsed time per iteration (ms): 5626.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812634E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:32:40.146376 | finish at 2025-09-10 11:53:21 + [2025-09-10 09:20:46] iteration 10293/ 11920 | consumed samples: 10540032 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817722E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:32:26.274010 | finish at 2025-09-10 11:53:12 + [2025-09-10 09:20:52] iteration 10294/ 11920 | consumed samples: 10541056 | elapsed time per iteration (ms): 5857.0 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813274E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:38:43.556615 | finish at 2025-09-10 11:59:35 + [2025-09-10 09:20:58] iteration 10295/ 11920 | consumed samples: 10542080 | elapsed time per iteration (ms): 5947.1 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805211E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:41:03.996309 | finish at 2025-09-10 12:02:02 + [2025-09-10 09:21:04] iteration 10296/ 11920 | consumed samples: 10543104 | elapsed time per iteration (ms): 5892.9 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.794514E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:39:30.085072 | finish at 2025-09-10 12:00:34 + [2025-09-10 09:21:09] iteration 10297/ 11920 | consumed samples: 10544128 | elapsed time per iteration (ms): 5626.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814204E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:32:12.515126 | finish at 2025-09-10 11:53:22 + [2025-09-10 09:21:15] iteration 10298/ 11920 | consumed samples: 10545152 | elapsed time per iteration (ms): 5640.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806252E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:32:28.600302 | finish at 2025-09-10 11:53:44 + [2025-09-10 09:21:21] iteration 10299/ 11920 | consumed samples: 10546176 | elapsed time per iteration (ms): 5629.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812869E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:32:04.629013 | finish at 2025-09-10 11:53:25 + [2025-09-10 09:21:26] iteration 10300/ 11920 | consumed samples: 10547200 | elapsed time per iteration (ms): 5615.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803607E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:31:37.722530 | finish at 2025-09-10 11:53:04 + [2025-09-10 09:21:32] iteration 10301/ 11920 | consumed samples: 10548224 | elapsed time per iteration (ms): 5617.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808333E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:31:34.341590 | finish at 2025-09-10 11:53:06 + [2025-09-10 09:21:38] iteration 10302/ 11920 | consumed samples: 10549248 | elapsed time per iteration (ms): 5988.6 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820518E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:41:29.500819 | finish at 2025-09-10 12:03:07 + [2025-09-10 09:21:44] iteration 10303/ 11920 | consumed samples: 10550272 | elapsed time per iteration (ms): 5934.5 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809551E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:39:56.110368 | finish at 2025-09-10 12:01:40 + [2025-09-10 09:21:49] iteration 10304/ 11920 | consumed samples: 10551296 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814724E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:31:25.314941 | finish at 2025-09-10 11:53:15 + [2025-09-10 09:21:55] iteration 10305/ 11920 | consumed samples: 10552320 | elapsed time per iteration (ms): 5628.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816779E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:31:30.124892 | finish at 2025-09-10 11:53:25 + [2025-09-10 09:22:01] iteration 10306/ 11920 | consumed samples: 10553344 | elapsed time per iteration (ms): 5631.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799295E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:31:28.486402 | finish at 2025-09-10 11:53:29 + [2025-09-10 09:22:06] iteration 10307/ 11920 | consumed samples: 10554368 | elapsed time per iteration (ms): 5629.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806203E+00 | loss scale: 1.0 | grad norm: 0.243 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:31:19.764587 | finish at 2025-09-10 11:53:26 + [2025-09-10 09:22:13] iteration 10308/ 11920 | consumed samples: 10555392 | elapsed time per iteration (ms): 6271.0 | throughput per GPU (TFLOP/s/GPU): 72.0 | MFU 7.28% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799396E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:48:28.861076 | finish at 2025-09-10 12:10:41 + [2025-09-10 09:22:18] iteration 10309/ 11920 | consumed samples: 10556416 | elapsed time per iteration (ms): 5636.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817691E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:31:20.610640 | finish at 2025-09-10 11:53:39 + [2025-09-10 09:22:24] iteration 10310/ 11920 | consumed samples: 10557440 | elapsed time per iteration (ms): 5617.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798125E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:30:44.783516 | finish at 2025-09-10 11:53:09 + [2025-09-10 09:22:29] iteration 10311/ 11920 | consumed samples: 10558464 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810143E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:30:42.332384 | finish at 2025-09-10 11:53:12 + [2025-09-10 09:22:35] iteration 10312/ 11920 | consumed samples: 10559488 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809659E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:30:38.446936 | finish at 2025-09-10 11:53:14 + [2025-09-10 09:22:41] iteration 10313/ 11920 | consumed samples: 10560512 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807825E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:30:33.205701 | finish at 2025-09-10 11:53:14 + [2025-09-10 09:22:46] iteration 10314/ 11920 | consumed samples: 10561536 | elapsed time per iteration (ms): 5626.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807324E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:30:36.044721 | finish at 2025-09-10 11:53:22 + [2025-09-10 09:22:52] iteration 10315/ 11920 | consumed samples: 10562560 | elapsed time per iteration (ms): 5880.8 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805108E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:37:18.724959 | finish at 2025-09-10 12:00:11 + [2025-09-10 09:22:58] iteration 10316/ 11920 | consumed samples: 10563584 | elapsed time per iteration (ms): 5856.7 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802955E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:36:34.126059 | finish at 2025-09-10 11:59:32 + [2025-09-10 09:23:04] iteration 10317/ 11920 | consumed samples: 10564608 | elapsed time per iteration (ms): 5626.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809105E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:30:18.537887 | finish at 2025-09-10 11:53:22 + [2025-09-10 09:23:10] iteration 10318/ 11920 | consumed samples: 10565632 | elapsed time per iteration (ms): 5836.6 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.796122E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:35:50.309312 | finish at 2025-09-10 11:59:00 + [2025-09-10 09:23:15] iteration 10319/ 11920 | consumed samples: 10566656 | elapsed time per iteration (ms): 5623.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824026E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:30:02.594619 | finish at 2025-09-10 11:53:18 + [2025-09-10 09:23:21] iteration 10320/ 11920 | consumed samples: 10567680 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814812E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:29:51.992569 | finish at 2025-09-10 11:53:13 + [2025-09-10 09:23:27] iteration 10321/ 11920 | consumed samples: 10568704 | elapsed time per iteration (ms): 5915.3 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807594E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:37:38.492094 | finish at 2025-09-10 12:01:05 + [2025-09-10 09:23:32] iteration 10322/ 11920 | consumed samples: 10569728 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807405E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:29:41.737445 | finish at 2025-09-10 11:53:14 + [2025-09-10 09:23:38] iteration 10323/ 11920 | consumed samples: 10570752 | elapsed time per iteration (ms): 5617.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806636E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:29:30.430265 | finish at 2025-09-10 11:53:08 + [2025-09-10 09:23:44] iteration 10324/ 11920 | consumed samples: 10571776 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805502E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:29:28.797598 | finish at 2025-09-10 11:53:12 + [2025-09-10 09:23:49] iteration 10325/ 11920 | consumed samples: 10572800 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.796151E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:29:28.445276 | finish at 2025-09-10 11:53:18 + [2025-09-10 09:23:55] iteration 10326/ 11920 | consumed samples: 10573824 | elapsed time per iteration (ms): 6012.1 | throughput per GPU (TFLOP/s/GPU): 75.1 | MFU 7.59% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805682E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:39:43.256207 | finish at 2025-09-10 12:03:38 + [2025-09-10 09:24:01] iteration 10327/ 11920 | consumed samples: 10574848 | elapsed time per iteration (ms): 5835.9 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813218E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:34:56.512849 | finish at 2025-09-10 11:58:58 + [2025-09-10 09:24:07] iteration 10328/ 11920 | consumed samples: 10575872 | elapsed time per iteration (ms): 5859.6 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805844E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:35:28.489759 | finish at 2025-09-10 11:59:35 + [2025-09-10 09:24:13] iteration 10329/ 11920 | consumed samples: 10576896 | elapsed time per iteration (ms): 5820.6 | throughput per GPU (TFLOP/s/GPU): 77.6 | MFU 7.84% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.795889E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:34:20.593618 | finish at 2025-09-10 11:58:33 + [2025-09-10 09:24:18] iteration 10330/ 11920 | consumed samples: 10577920 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814501E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:29:01.630533 | finish at 2025-09-10 11:53:20 + [2025-09-10 09:24:24] iteration 10331/ 11920 | consumed samples: 10578944 | elapsed time per iteration (ms): 5628.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798874E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:29:03.891432 | finish at 2025-09-10 11:53:28 + [2025-09-10 09:24:30] iteration 10332/ 11920 | consumed samples: 10579968 | elapsed time per iteration (ms): 5642.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817695E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:29:19.709472 | finish at 2025-09-10 11:53:49 + [2025-09-10 09:24:35] iteration 10333/ 11920 | consumed samples: 10580992 | elapsed time per iteration (ms): 5617.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810846E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:28:35.489460 | finish at 2025-09-10 11:53:11 + [2025-09-10 09:24:41] iteration 10334/ 11920 | consumed samples: 10582016 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809763E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:28:38.659040 | finish at 2025-09-10 11:53:20 + [2025-09-10 09:24:46] iteration 10335/ 11920 | consumed samples: 10583040 | elapsed time per iteration (ms): 5619.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819743E+00 | loss scale: 1.0 | grad norm: 0.242 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:28:26.373413 | finish at 2025-09-10 11:53:13 + [2025-09-10 09:24:52] iteration 10336/ 11920 | consumed samples: 10584064 | elapsed time per iteration (ms): 5636.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802318E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:28:48.900501 | finish at 2025-09-10 11:53:41 + [2025-09-10 09:24:58] iteration 10337/ 11920 | consumed samples: 10585088 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813998E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:28:21.486253 | finish at 2025-09-10 11:53:19 + [2025-09-10 09:25:04] iteration 10338/ 11920 | consumed samples: 10586112 | elapsed time per iteration (ms): 5986.7 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820539E+00 | loss scale: 1.0 | grad norm: 0.252 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:37:50.979859 | finish at 2025-09-10 12:02:55 + [2025-09-10 09:25:09] iteration 10339/ 11920 | consumed samples: 10587136 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812245E+00 | loss scale: 1.0 | grad norm: 0.271 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:28:13.107284 | finish at 2025-09-10 11:53:22 + [2025-09-10 09:25:15] iteration 10340/ 11920 | consumed samples: 10588160 | elapsed time per iteration (ms): 5938.1 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820739E+00 | loss scale: 1.0 | grad norm: 0.259 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:36:22.208328 | finish at 2025-09-10 12:01:37 + [2025-09-10 09:25:21] iteration 10341/ 11920 | consumed samples: 10589184 | elapsed time per iteration (ms): 5634.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814452E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:28:17.375485 | finish at 2025-09-10 11:53:38 + [2025-09-10 09:25:27] iteration 10342/ 11920 | consumed samples: 10590208 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810992E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:27:56.984390 | finish at 2025-09-10 11:53:24 + [2025-09-10 09:25:32] iteration 10343/ 11920 | consumed samples: 10591232 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820266E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:27:52.742554 | finish at 2025-09-10 11:53:25 + [2025-09-10 09:25:38] iteration 10344/ 11920 | consumed samples: 10592256 | elapsed time per iteration (ms): 5631.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.797002E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:27:54.578936 | finish at 2025-09-10 11:53:32 + [2025-09-10 09:25:43] iteration 10345/ 11920 | consumed samples: 10593280 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812900E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:27:32.235818 | finish at 2025-09-10 11:53:16 + [2025-09-10 09:25:49] iteration 10346/ 11920 | consumed samples: 10594304 | elapsed time per iteration (ms): 5632.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838038E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:27:45.080553 | finish at 2025-09-10 11:53:34 + [2025-09-10 09:25:55] iteration 10347/ 11920 | consumed samples: 10595328 | elapsed time per iteration (ms): 5629.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802623E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:27:34.440546 | finish at 2025-09-10 11:53:29 + [2025-09-10 09:26:01] iteration 10348/ 11920 | consumed samples: 10596352 | elapsed time per iteration (ms): 5860.9 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799675E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:33:33.260848 | finish at 2025-09-10 11:59:34 + [2025-09-10 09:26:06] iteration 10349/ 11920 | consumed samples: 10597376 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808906E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:27:15.123578 | finish at 2025-09-10 11:53:21 + [2025-09-10 09:26:12] iteration 10350/ 11920 | consumed samples: 10598400 | elapsed time per iteration (ms): 5635.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813839E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:27:27.748778 | finish at 2025-09-10 11:53:40 + [2025-09-10 09:26:17] iteration 10351/ 11920 | consumed samples: 10599424 | elapsed time per iteration (ms): 5632.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.797671E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:27:17.334413 | finish at 2025-09-10 11:53:35 + [2025-09-10 09:26:23] iteration 10352/ 11920 | consumed samples: 10600448 | elapsed time per iteration (ms): 5630.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.797519E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:27:08.511971 | finish at 2025-09-10 11:53:32 + [2025-09-10 09:26:29] iteration 10353/ 11920 | consumed samples: 10601472 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806276E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:26:48.398117 | finish at 2025-09-10 11:53:17 + [2025-09-10 09:26:34] iteration 10354/ 11920 | consumed samples: 10602496 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802985E+00 | loss scale: 1.0 | grad norm: 0.135 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:26:43.406795 | finish at 2025-09-10 11:53:18 + [2025-09-10 09:26:40] iteration 10355/ 11920 | consumed samples: 10603520 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806166E+00 | loss scale: 1.0 | grad norm: 0.133 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:26:43.449992 | finish at 2025-09-10 11:53:23 + [2025-09-10 09:26:46] iteration 10356/ 11920 | consumed samples: 10604544 | elapsed time per iteration (ms): 5864.3 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812143E+00 | loss scale: 1.0 | grad norm: 0.135 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:32:51.687286 | finish at 2025-09-10 11:59:37 + [2025-09-10 09:26:51] iteration 10357/ 11920 | consumed samples: 10605568 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805821E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:26:33.070083 | finish at 2025-09-10 11:53:24 + [2025-09-10 09:26:57] iteration 10358/ 11920 | consumed samples: 10606592 | elapsed time per iteration (ms): 5989.6 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815857E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:35:55.747663 | finish at 2025-09-10 12:02:53 + [2025-09-10 09:27:03] iteration 10359/ 11920 | consumed samples: 10607616 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811936E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:26:19.505136 | finish at 2025-09-10 11:53:23 + [2025-09-10 09:27:09] iteration 10360/ 11920 | consumed samples: 10608640 | elapsed time per iteration (ms): 5874.4 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814367E+00 | loss scale: 1.0 | grad norm: 0.243 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:32:44.135628 | finish at 2025-09-10 11:59:53 + [2025-09-10 09:27:15] iteration 10361/ 11920 | consumed samples: 10609664 | elapsed time per iteration (ms): 5990.5 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813923E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:35:39.156749 | finish at 2025-09-10 12:02:54 + [2025-09-10 09:27:21] iteration 10362/ 11920 | consumed samples: 10610688 | elapsed time per iteration (ms): 5628.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.795264E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:26:09.883484 | finish at 2025-09-10 11:53:30 + [2025-09-10 09:27:26] iteration 10363/ 11920 | consumed samples: 10611712 | elapsed time per iteration (ms): 5630.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815403E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:26:06.807040 | finish at 2025-09-10 11:53:33 + [2025-09-10 09:27:32] iteration 10364/ 11920 | consumed samples: 10612736 | elapsed time per iteration (ms): 5978.4 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821086E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:35:02.369613 | finish at 2025-09-10 12:02:35 + [2025-09-10 09:27:38] iteration 10365/ 11920 | consumed samples: 10613760 | elapsed time per iteration (ms): 5631.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807149E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:25:56.391177 | finish at 2025-09-10 11:53:34 + [2025-09-10 09:27:44] iteration 10366/ 11920 | consumed samples: 10614784 | elapsed time per iteration (ms): 6338.4 | throughput per GPU (TFLOP/s/GPU): 71.2 | MFU 7.20% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802742E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:44:09.862680 | finish at 2025-09-10 12:11:54 + [2025-09-10 09:27:50] iteration 10367/ 11920 | consumed samples: 10615808 | elapsed time per iteration (ms): 5628.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805018E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:25:40.255152 | finish at 2025-09-10 11:53:30 + [2025-09-10 09:27:55] iteration 10368/ 11920 | consumed samples: 10616832 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804307E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:25:24.385971 | finish at 2025-09-10 11:53:20 + [2025-09-10 09:28:01] iteration 10369/ 11920 | consumed samples: 10617856 | elapsed time per iteration (ms): 5629.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819196E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:25:31.912743 | finish at 2025-09-10 11:53:33 + [2025-09-10 09:28:07] iteration 10370/ 11920 | consumed samples: 10618880 | elapsed time per iteration (ms): 5841.4 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816314E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:30:54.164684 | finish at 2025-09-10 11:59:01 + [2025-09-10 09:28:12] iteration 10371/ 11920 | consumed samples: 10619904 | elapsed time per iteration (ms): 5619.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824582E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:25:04.746086 | finish at 2025-09-10 11:53:17 + [2025-09-10 09:28:18] iteration 10372/ 11920 | consumed samples: 10620928 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819156E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:25:07.147905 | finish at 2025-09-10 11:53:25 + [2025-09-10 09:28:24] iteration 10373/ 11920 | consumed samples: 10621952 | elapsed time per iteration (ms): 5622.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811472E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:24:57.988601 | finish at 2025-09-10 11:53:22 + [2025-09-10 09:28:29] iteration 10374/ 11920 | consumed samples: 10622976 | elapsed time per iteration (ms): 5639.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.793827E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:25:18.376765 | finish at 2025-09-10 11:53:48 + [2025-09-10 09:28:35] iteration 10375/ 11920 | consumed samples: 10624000 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802917E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:24:48.065289 | finish at 2025-09-10 11:53:23 + [2025-09-10 09:28:41] iteration 10376/ 11920 | consumed samples: 10625024 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821676E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:24:47.064775 | finish at 2025-09-10 11:53:28 + [2025-09-10 09:28:46] iteration 10377/ 11920 | consumed samples: 10626048 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806003E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:24:37.800842 | finish at 2025-09-10 11:53:24 + [2025-09-10 09:28:52] iteration 10378/ 11920 | consumed samples: 10627072 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811181E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:24:32.606267 | finish at 2025-09-10 11:53:24 + [2025-09-10 09:28:57] iteration 10379/ 11920 | consumed samples: 10628096 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810398E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 10.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:24:26.845703 | finish at 2025-09-10 11:53:24 + [2025-09-10 09:29:03] iteration 10380/ 11920 | consumed samples: 10629120 | elapsed time per iteration (ms): 5637.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804270E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:24:41.068616 | finish at 2025-09-10 11:53:44 + [2025-09-10 09:29:09] iteration 10381/ 11920 | consumed samples: 10630144 | elapsed time per iteration (ms): 5618.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810785E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:24:07.250526 | finish at 2025-09-10 11:53:16 + [2025-09-10 09:29:14] iteration 10382/ 11920 | consumed samples: 10631168 | elapsed time per iteration (ms): 5631.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813832E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:24:20.522434 | finish at 2025-09-10 11:53:35 + [2025-09-10 09:29:20] iteration 10383/ 11920 | consumed samples: 10632192 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818393E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:23:59.038439 | finish at 2025-09-10 11:53:19 + [2025-09-10 09:29:26] iteration 10384/ 11920 | consumed samples: 10633216 | elapsed time per iteration (ms): 5615.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821800E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:23:46.075928 | finish at 2025-09-10 11:53:12 + [2025-09-10 09:29:31] iteration 10385/ 11920 | consumed samples: 10634240 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814873E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:23:51.868820 | finish at 2025-09-10 11:53:23 + [2025-09-10 09:29:37] iteration 10386/ 11920 | consumed samples: 10635264 | elapsed time per iteration (ms): 5622.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808901E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:23:45.058280 | finish at 2025-09-10 11:53:22 + [2025-09-10 09:29:42] iteration 10387/ 11920 | consumed samples: 10636288 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809316E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:23:39.804472 | finish at 2025-09-10 11:53:22 + [2025-09-10 09:29:48] iteration 10388/ 11920 | consumed samples: 10637312 | elapsed time per iteration (ms): 5618.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.797426E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:23:27.487933 | finish at 2025-09-10 11:53:16 + [2025-09-10 09:29:54] iteration 10389/ 11920 | consumed samples: 10638336 | elapsed time per iteration (ms): 5632.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798965E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:23:42.589033 | finish at 2025-09-10 11:53:36 + [2025-09-10 09:29:59] iteration 10390/ 11920 | consumed samples: 10639360 | elapsed time per iteration (ms): 5624.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813313E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:23:25.675106 | finish at 2025-09-10 11:53:25 + [2025-09-10 09:30:05] iteration 10391/ 11920 | consumed samples: 10640384 | elapsed time per iteration (ms): 5628.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798238E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:23:25.602821 | finish at 2025-09-10 11:53:31 + [2025-09-10 09:30:11] iteration 10392/ 11920 | consumed samples: 10641408 | elapsed time per iteration (ms): 5631.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.793035E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:23:24.441656 | finish at 2025-09-10 11:53:35 + [2025-09-10 09:30:16] iteration 10393/ 11920 | consumed samples: 10642432 | elapsed time per iteration (ms): 5627.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805533E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:23:13.589054 | finish at 2025-09-10 11:53:30 + [2025-09-10 09:30:22] iteration 10394/ 11920 | consumed samples: 10643456 | elapsed time per iteration (ms): 5630.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804287E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:23:11.436568 | finish at 2025-09-10 11:53:33 + [2025-09-10 09:30:28] iteration 10395/ 11920 | consumed samples: 10644480 | elapsed time per iteration (ms): 6255.9 | throughput per GPU (TFLOP/s/GPU): 72.2 | MFU 7.30% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812943E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:39:00.218270 | finish at 2025-09-10 12:09:28 + [2025-09-10 09:30:34] iteration 10396/ 11920 | consumed samples: 10645504 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807703E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:22:45.272970 | finish at 2025-09-10 11:53:19 + [2025-09-10 09:30:39] iteration 10397/ 11920 | consumed samples: 10646528 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810276E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:22:40.964271 | finish at 2025-09-10 11:53:20 + [2025-09-10 09:30:45] iteration 10398/ 11920 | consumed samples: 10647552 | elapsed time per iteration (ms): 5628.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.794121E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 10.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:22:47.259904 | finish at 2025-09-10 11:53:32 + [2025-09-10 09:30:51] iteration 10399/ 11920 | consumed samples: 10648576 | elapsed time per iteration (ms): 5618.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801683E+00 | loss scale: 1.0 | grad norm: 0.238 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:22:25.432428 | finish at 2025-09-10 11:53:16 + [2025-09-10 09:30:56] iteration 10400/ 11920 | consumed samples: 10649600 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804183E+00 | loss scale: 1.0 | grad norm: 0.245 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:22:33.460884 | finish at 2025-09-10 11:53:30 + [2025-09-10 09:31:02] iteration 10401/ 11920 | consumed samples: 10650624 | elapsed time per iteration (ms): 5629.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814868E+00 | loss scale: 1.0 | grad norm: 0.229 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:22:31.045585 | finish at 2025-09-10 11:53:33 + [2025-09-10 09:31:07] iteration 10402/ 11920 | consumed samples: 10651648 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808080E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:22:19.989574 | finish at 2025-09-10 11:53:27 + [2025-09-10 09:31:13] iteration 10403/ 11920 | consumed samples: 10652672 | elapsed time per iteration (ms): 5956.5 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813339E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:30:35.944393 | finish at 2025-09-10 12:01:49 + [2025-09-10 09:31:19] iteration 10404/ 11920 | consumed samples: 10653696 | elapsed time per iteration (ms): 5644.5 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819425E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:22:37.024436 | finish at 2025-09-10 11:53:56 + [2025-09-10 09:31:25] iteration 10405/ 11920 | consumed samples: 10654720 | elapsed time per iteration (ms): 5633.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806410E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:22:15.402095 | finish at 2025-09-10 11:53:40 + [2025-09-10 09:31:30] iteration 10406/ 11920 | consumed samples: 10655744 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817037E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:21:50.089759 | finish at 2025-09-10 11:53:20 + [2025-09-10 09:31:36] iteration 10407/ 11920 | consumed samples: 10656768 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800759E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:21:43.737273 | finish at 2025-09-10 11:53:20 + [2025-09-10 09:31:42] iteration 10408/ 11920 | consumed samples: 10657792 | elapsed time per iteration (ms): 5873.0 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824765E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:27:59.932240 | finish at 2025-09-10 11:59:42 + [2025-09-10 09:31:47] iteration 10409/ 11920 | consumed samples: 10658816 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804789E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:21:38.031626 | finish at 2025-09-10 11:53:25 + [2025-09-10 09:31:53] iteration 10410/ 11920 | consumed samples: 10659840 | elapsed time per iteration (ms): 5627.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802399E+00 | loss scale: 1.0 | grad norm: 0.242 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:21:37.336800 | finish at 2025-09-10 11:53:30 + [2025-09-10 09:31:59] iteration 10411/ 11920 | consumed samples: 10660864 | elapsed time per iteration (ms): 5649.7 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812650E+00 | loss scale: 1.0 | grad norm: 0.229 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:22:05.454033 | finish at 2025-09-10 11:54:04 + [2025-09-10 09:32:04] iteration 10412/ 11920 | consumed samples: 10661888 | elapsed time per iteration (ms): 5634.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801885E+00 | loss scale: 1.0 | grad norm: 0.230 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:21:36.179955 | finish at 2025-09-10 11:53:41 + [2025-09-10 09:32:10] iteration 10413/ 11920 | consumed samples: 10662912 | elapsed time per iteration (ms): 5950.7 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821944E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:29:27.741616 | finish at 2025-09-10 12:01:38 + [2025-09-10 09:32:16] iteration 10414/ 11920 | consumed samples: 10663936 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803562E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:21:09.505335 | finish at 2025-09-10 11:53:25 + [2025-09-10 09:32:22] iteration 10415/ 11920 | consumed samples: 10664960 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821711E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:21:06.566185 | finish at 2025-09-10 11:53:28 + [2025-09-10 09:32:27] iteration 10416/ 11920 | consumed samples: 10665984 | elapsed time per iteration (ms): 5615.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817192E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:20:46.180267 | finish at 2025-09-10 11:53:13 + [2025-09-10 09:32:33] iteration 10417/ 11920 | consumed samples: 10667008 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798336E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:20:46.992057 | finish at 2025-09-10 11:53:20 + [2025-09-10 09:32:38] iteration 10418/ 11920 | consumed samples: 10668032 | elapsed time per iteration (ms): 5626.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818119E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:20:51.579027 | finish at 2025-09-10 11:53:30 + [2025-09-10 09:32:44] iteration 10419/ 11920 | consumed samples: 10669056 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800662E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:20:37.647856 | finish at 2025-09-10 11:53:22 + [2025-09-10 09:32:50] iteration 10420/ 11920 | consumed samples: 10670080 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803496E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:20:32.601571 | finish at 2025-09-10 11:53:22 + [2025-09-10 09:32:55] iteration 10421/ 11920 | consumed samples: 10671104 | elapsed time per iteration (ms): 5629.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807631E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:20:38.175419 | finish at 2025-09-10 11:53:33 + [2025-09-10 09:33:01] iteration 10422/ 11920 | consumed samples: 10672128 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804376E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:20:21.361317 | finish at 2025-09-10 11:53:22 + [2025-09-10 09:33:07] iteration 10423/ 11920 | consumed samples: 10673152 | elapsed time per iteration (ms): 5632.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807159E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:20:32.291759 | finish at 2025-09-10 11:53:39 + [2025-09-10 09:33:12] iteration 10424/ 11920 | consumed samples: 10674176 | elapsed time per iteration (ms): 5630.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.789083E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:20:23.576231 | finish at 2025-09-10 11:53:36 + [2025-09-10 09:33:18] iteration 10425/ 11920 | consumed samples: 10675200 | elapsed time per iteration (ms): 5632.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807902E+00 | loss scale: 1.0 | grad norm: 0.229 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:20:19.890925 | finish at 2025-09-10 11:53:38 + [2025-09-10 09:33:24] iteration 10426/ 11920 | consumed samples: 10676224 | elapsed time per iteration (ms): 5953.0 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820713E+00 | loss scale: 1.0 | grad norm: 0.241 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:28:13.795638 | finish at 2025-09-10 12:01:38 + [2025-09-10 09:33:29] iteration 10427/ 11920 | consumed samples: 10677248 | elapsed time per iteration (ms): 5626.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.788409E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:20:00.560471 | finish at 2025-09-10 11:53:30 + [2025-09-10 09:33:35] iteration 10428/ 11920 | consumed samples: 10678272 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819722E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:19:46.396192 | finish at 2025-09-10 11:53:21 + [2025-09-10 09:33:41] iteration 10429/ 11920 | consumed samples: 10679296 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804487E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:19:43.539511 | finish at 2025-09-10 11:53:24 + [2025-09-10 09:33:46] iteration 10430/ 11920 | consumed samples: 10680320 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821267E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:19:36.981747 | finish at 2025-09-10 11:53:23 + [2025-09-10 09:33:52] iteration 10431/ 11920 | consumed samples: 10681344 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799162E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:19:30.445118 | finish at 2025-09-10 11:53:22 + [2025-09-10 09:33:58] iteration 10432/ 11920 | consumed samples: 10682368 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806812E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:19:24.631668 | finish at 2025-09-10 11:53:22 + [2025-09-10 09:34:03] iteration 10433/ 11920 | consumed samples: 10683392 | elapsed time per iteration (ms): 5836.5 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810552E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:24:38.868659 | finish at 2025-09-10 11:58:42 + [2025-09-10 09:34:09] iteration 10434/ 11920 | consumed samples: 10684416 | elapsed time per iteration (ms): 5614.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808432E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:19:03.620399 | finish at 2025-09-10 11:53:13 + [2025-09-10 09:34:15] iteration 10435/ 11920 | consumed samples: 10685440 | elapsed time per iteration (ms): 5835.6 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.787991E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:24:25.861956 | finish at 2025-09-10 11:58:41 + [2025-09-10 09:34:20] iteration 10436/ 11920 | consumed samples: 10686464 | elapsed time per iteration (ms): 5627.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812539E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:19:11.286509 | finish at 2025-09-10 11:53:32 + [2025-09-10 09:34:26] iteration 10437/ 11920 | consumed samples: 10687488 | elapsed time per iteration (ms): 5622.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809811E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:18:58.308491 | finish at 2025-09-10 11:53:24 + [2025-09-10 09:34:32] iteration 10438/ 11920 | consumed samples: 10688512 | elapsed time per iteration (ms): 5614.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803558E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:18:40.925803 | finish at 2025-09-10 11:53:13 + [2025-09-10 09:34:37] iteration 10439/ 11920 | consumed samples: 10689536 | elapsed time per iteration (ms): 5614.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806534E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:18:35.457326 | finish at 2025-09-10 11:53:13 + [2025-09-10 09:34:43] iteration 10440/ 11920 | consumed samples: 10690560 | elapsed time per iteration (ms): 5619.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809592E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:18:37.172518 | finish at 2025-09-10 11:53:20 + [2025-09-10 09:34:49] iteration 10441/ 11920 | consumed samples: 10691584 | elapsed time per iteration (ms): 5618.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813245E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:18:29.282985 | finish at 2025-09-10 11:53:18 + [2025-09-10 09:34:54] iteration 10442/ 11920 | consumed samples: 10692608 | elapsed time per iteration (ms): 5642.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806976E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:18:59.392886 | finish at 2025-09-10 11:53:54 + [2025-09-10 09:35:00] iteration 10443/ 11920 | consumed samples: 10693632 | elapsed time per iteration (ms): 5619.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799932E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:18:19.333015 | finish at 2025-09-10 11:53:19 + [2025-09-10 09:35:05] iteration 10444/ 11920 | consumed samples: 10694656 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810029E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:18:28.775537 | finish at 2025-09-10 11:53:34 + [2025-09-10 09:35:11] iteration 10445/ 11920 | consumed samples: 10695680 | elapsed time per iteration (ms): 5830.7 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806075E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:23:20.213808 | finish at 2025-09-10 11:58:31 + [2025-09-10 09:35:17] iteration 10446/ 11920 | consumed samples: 10696704 | elapsed time per iteration (ms): 5621.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806603E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:18:05.310147 | finish at 2025-09-10 11:53:22 + [2025-09-10 09:35:22] iteration 10447/ 11920 | consumed samples: 10697728 | elapsed time per iteration (ms): 5635.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800586E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:18:21.125147 | finish at 2025-09-10 11:53:44 + [2025-09-10 09:35:28] iteration 10448/ 11920 | consumed samples: 10698752 | elapsed time per iteration (ms): 5617.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808701E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:17:48.342072 | finish at 2025-09-10 11:53:16 + [2025-09-10 09:35:34] iteration 10449/ 11920 | consumed samples: 10699776 | elapsed time per iteration (ms): 5627.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815248E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:17:58.689831 | finish at 2025-09-10 11:53:32 + [2025-09-10 09:35:40] iteration 10450/ 11920 | consumed samples: 10700800 | elapsed time per iteration (ms): 6217.7 | throughput per GPU (TFLOP/s/GPU): 72.6 | MFU 7.34% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809167E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:32:20.003235 | finish at 2025-09-10 12:08:00 + [2025-09-10 09:35:46] iteration 10451/ 11920 | consumed samples: 10701824 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.784585E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:17:42.026657 | finish at 2025-09-10 11:53:28 + [2025-09-10 09:35:52] iteration 10452/ 11920 | consumed samples: 10702848 | elapsed time per iteration (ms): 6269.8 | throughput per GPU (TFLOP/s/GPU): 72.0 | MFU 7.28% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811133E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:33:24.053023 | finish at 2025-09-10 12:09:16 + [2025-09-10 09:35:57] iteration 10453/ 11920 | consumed samples: 10703872 | elapsed time per iteration (ms): 5635.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800913E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:17:46.715319 | finish at 2025-09-10 11:53:44 + [2025-09-10 09:36:03] iteration 10454/ 11920 | consumed samples: 10704896 | elapsed time per iteration (ms): 5628.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798776E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:17:32.033535 | finish at 2025-09-10 11:53:35 + [2025-09-10 09:36:09] iteration 10455/ 11920 | consumed samples: 10705920 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.792319E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:17:18.150678 | finish at 2025-09-10 11:53:27 + [2025-09-10 09:36:14] iteration 10456/ 11920 | consumed samples: 10706944 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804593E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:17:10.836245 | finish at 2025-09-10 11:53:25 + [2025-09-10 09:36:20] iteration 10457/ 11920 | consumed samples: 10707968 | elapsed time per iteration (ms): 5964.4 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804327E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:25:25.899488 | finish at 2025-09-10 12:01:46 + [2025-09-10 09:36:26] iteration 10458/ 11920 | consumed samples: 10708992 | elapsed time per iteration (ms): 5618.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817523E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:16:54.249432 | finish at 2025-09-10 11:53:20 + [2025-09-10 09:36:32] iteration 10459/ 11920 | consumed samples: 10710016 | elapsed time per iteration (ms): 5613.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811378E+00 | loss scale: 1.0 | grad norm: 0.255 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:16:41.724600 | finish at 2025-09-10 11:53:13 + [2025-09-10 09:36:38] iteration 10460/ 11920 | consumed samples: 10711040 | elapsed time per iteration (ms): 5962.9 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802018E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:25:05.840154 | finish at 2025-09-10 12:01:43 + [2025-09-10 09:36:43] iteration 10461/ 11920 | consumed samples: 10712064 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817860E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:16:42.213774 | finish at 2025-09-10 11:53:25 + [2025-09-10 09:36:49] iteration 10462/ 11920 | consumed samples: 10713088 | elapsed time per iteration (ms): 5918.0 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806069E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:23:48.499941 | finish at 2025-09-10 12:00:38 + [2025-09-10 09:36:55] iteration 10463/ 11920 | consumed samples: 10714112 | elapsed time per iteration (ms): 5627.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810654E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:16:38.474872 | finish at 2025-09-10 11:53:33 + [2025-09-10 09:37:00] iteration 10464/ 11920 | consumed samples: 10715136 | elapsed time per iteration (ms): 5628.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807864E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:16:34.629772 | finish at 2025-09-10 11:53:35 + [2025-09-10 09:37:06] iteration 10465/ 11920 | consumed samples: 10716160 | elapsed time per iteration (ms): 5630.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808458E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:16:32.281870 | finish at 2025-09-10 11:53:38 + [2025-09-10 09:37:12] iteration 10466/ 11920 | consumed samples: 10717184 | elapsed time per iteration (ms): 5931.8 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809427E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:23:44.772213 | finish at 2025-09-10 12:00:57 + [2025-09-10 09:37:18] iteration 10467/ 11920 | consumed samples: 10718208 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818244E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:16:10.497388 | finish at 2025-09-10 11:53:28 + [2025-09-10 09:37:23] iteration 10468/ 11920 | consumed samples: 10719232 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806100E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:16:01.045750 | finish at 2025-09-10 11:53:24 + [2025-09-10 09:37:29] iteration 10469/ 11920 | consumed samples: 10720256 | elapsed time per iteration (ms): 5616.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822841E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:15:48.881984 | finish at 2025-09-10 11:53:18 + [2025-09-10 09:37:34] iteration 10470/ 11920 | consumed samples: 10721280 | elapsed time per iteration (ms): 5616.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814179E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:15:43.151164 | finish at 2025-09-10 11:53:18 + [2025-09-10 09:37:40] iteration 10471/ 11920 | consumed samples: 10722304 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809433E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:15:45.792586 | finish at 2025-09-10 11:53:26 + [2025-09-10 09:37:46] iteration 10472/ 11920 | consumed samples: 10723328 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813026E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:15:37.918295 | finish at 2025-09-10 11:53:24 + [2025-09-10 09:37:51] iteration 10473/ 11920 | consumed samples: 10724352 | elapsed time per iteration (ms): 5615.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.795886E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:15:24.889489 | finish at 2025-09-10 11:53:16 + [2025-09-10 09:37:57] iteration 10474/ 11920 | consumed samples: 10725376 | elapsed time per iteration (ms): 5635.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813655E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:15:48.390982 | finish at 2025-09-10 11:53:45 + [2025-09-10 09:38:03] iteration 10475/ 11920 | consumed samples: 10726400 | elapsed time per iteration (ms): 6346.5 | throughput per GPU (TFLOP/s/GPU): 71.1 | MFU 7.19% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800596E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:32:50.725113 | finish at 2025-09-10 12:10:54 + [2025-09-10 09:38:09] iteration 10476/ 11920 | consumed samples: 10727424 | elapsed time per iteration (ms): 5850.5 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801303E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:20:48.186534 | finish at 2025-09-10 11:58:57 + [2025-09-10 09:38:15] iteration 10477/ 11920 | consumed samples: 10728448 | elapsed time per iteration (ms): 5628.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798466E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:15:22.484196 | finish at 2025-09-10 11:53:37 + [2025-09-10 09:38:21] iteration 10478/ 11920 | consumed samples: 10729472 | elapsed time per iteration (ms): 5984.1 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803621E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:23:49.057909 | finish at 2025-09-10 12:02:10 + [2025-09-10 09:38:26] iteration 10479/ 11920 | consumed samples: 10730496 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.786723E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:15:00.896568 | finish at 2025-09-10 11:53:27 + [2025-09-10 09:38:32] iteration 10480/ 11920 | consumed samples: 10731520 | elapsed time per iteration (ms): 5921.4 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811400E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:22:06.803741 | finish at 2025-09-10 12:00:39 + [2025-09-10 09:38:38] iteration 10481/ 11920 | consumed samples: 10732544 | elapsed time per iteration (ms): 5616.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807511E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:14:41.479489 | finish at 2025-09-10 11:53:19 + [2025-09-10 09:38:44] iteration 10482/ 11920 | consumed samples: 10733568 | elapsed time per iteration (ms): 5961.8 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798332E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:22:53.080198 | finish at 2025-09-10 12:01:37 + [2025-09-10 09:38:50] iteration 10483/ 11920 | consumed samples: 10734592 | elapsed time per iteration (ms): 5958.7 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809239E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:22:42.638797 | finish at 2025-09-10 12:01:32 + [2025-09-10 09:38:55] iteration 10484/ 11920 | consumed samples: 10735616 | elapsed time per iteration (ms): 5626.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821331E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:14:40.069480 | finish at 2025-09-10 11:53:35 + [2025-09-10 09:39:01] iteration 10485/ 11920 | consumed samples: 10736640 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800391E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:14:32.218841 | finish at 2025-09-10 11:53:33 + [2025-09-10 09:39:07] iteration 10486/ 11920 | consumed samples: 10737664 | elapsed time per iteration (ms): 5625.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799220E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:14:27.608680 | finish at 2025-09-10 11:53:34 + [2025-09-10 09:39:12] iteration 10487/ 11920 | consumed samples: 10738688 | elapsed time per iteration (ms): 5836.6 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809356E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:19:23.780929 | finish at 2025-09-10 11:58:36 + [2025-09-10 09:39:18] iteration 10488/ 11920 | consumed samples: 10739712 | elapsed time per iteration (ms): 5648.1 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820634E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:14:48.044573 | finish at 2025-09-10 11:54:06 + [2025-09-10 09:39:24] iteration 10489/ 11920 | consumed samples: 10740736 | elapsed time per iteration (ms): 5619.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812610E+00 | loss scale: 1.0 | grad norm: 0.249 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:14:01.039705 | finish at 2025-09-10 11:53:25 + [2025-09-10 09:39:29] iteration 10490/ 11920 | consumed samples: 10741760 | elapsed time per iteration (ms): 5617.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815106E+00 | loss scale: 1.0 | grad norm: 0.256 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:13:53.487067 | finish at 2025-09-10 11:53:23 + [2025-09-10 09:39:35] iteration 10491/ 11920 | consumed samples: 10742784 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808964E+00 | loss scale: 1.0 | grad norm: 0.254 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:13:52.529341 | finish at 2025-09-10 11:53:27 + [2025-09-10 09:39:41] iteration 10492/ 11920 | consumed samples: 10743808 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821693E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:13:46.501064 | finish at 2025-09-10 11:53:27 + [2025-09-10 09:39:46] iteration 10493/ 11920 | consumed samples: 10744832 | elapsed time per iteration (ms): 5619.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814516E+00 | loss scale: 1.0 | grad norm: 0.255 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:13:38.450390 | finish at 2025-09-10 11:53:25 + [2025-09-10 09:39:52] iteration 10494/ 11920 | consumed samples: 10745856 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809709E+00 | loss scale: 1.0 | grad norm: 0.238 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:13:34.856924 | finish at 2025-09-10 11:53:27 + [2025-09-10 09:39:57] iteration 10495/ 11920 | consumed samples: 10746880 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799826E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:13:37.517388 | finish at 2025-09-10 11:53:35 + [2025-09-10 09:40:03] iteration 10496/ 11920 | consumed samples: 10747904 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824912E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:13:25.963249 | finish at 2025-09-10 11:53:29 + [2025-09-10 09:40:09] iteration 10497/ 11920 | consumed samples: 10748928 | elapsed time per iteration (ms): 5630.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804016E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:13:32.299321 | finish at 2025-09-10 11:53:41 + [2025-09-10 09:40:14] iteration 10498/ 11920 | consumed samples: 10749952 | elapsed time per iteration (ms): 5616.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818452E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:13:06.989685 | finish at 2025-09-10 11:53:21 + [2025-09-10 09:40:20] iteration 10499/ 11920 | consumed samples: 10750976 | elapsed time per iteration (ms): 5628.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812696E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:13:17.638736 | finish at 2025-09-10 11:53:38 + [2025-09-10 09:40:26] iteration 10500/ 11920 | consumed samples: 10752000 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807877E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:13:08.185911 | finish at 2025-09-10 11:53:34 + [2025-09-10 09:40:31] iteration 10501/ 11920 | consumed samples: 10753024 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813352E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:13:02.179146 | finish at 2025-09-10 11:53:33 + [2025-09-10 09:40:37] iteration 10502/ 11920 | consumed samples: 10754048 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807337E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:12:50.084142 | finish at 2025-09-10 11:53:27 + [2025-09-10 09:40:42] iteration 10503/ 11920 | consumed samples: 10755072 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805968E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:12:45.493562 | finish at 2025-09-10 11:53:28 + [2025-09-10 09:40:48] iteration 10504/ 11920 | consumed samples: 10756096 | elapsed time per iteration (ms): 5855.1 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804588E+00 | loss scale: 1.0 | grad norm: 0.255 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:18:10.758013 | finish at 2025-09-10 11:58:59 + [2025-09-10 09:40:54] iteration 10505/ 11920 | consumed samples: 10757120 | elapsed time per iteration (ms): 5629.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818161E+00 | loss scale: 1.0 | grad norm: 0.240 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:12:44.995118 | finish at 2025-09-10 11:53:39 + [2025-09-10 09:41:00] iteration 10506/ 11920 | consumed samples: 10758144 | elapsed time per iteration (ms): 5862.1 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814055E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:18:08.939791 | finish at 2025-09-10 11:59:09 + [2025-09-10 09:41:05] iteration 10507/ 11920 | consumed samples: 10759168 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.791416E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:12:28.078510 | finish at 2025-09-10 11:53:33 + [2025-09-10 09:41:11] iteration 10508/ 11920 | consumed samples: 10760192 | elapsed time per iteration (ms): 5627.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799305E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:12:26.651531 | finish at 2025-09-10 11:53:38 + [2025-09-10 09:41:17] iteration 10509/ 11920 | consumed samples: 10761216 | elapsed time per iteration (ms): 5633.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802692E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:12:28.824570 | finish at 2025-09-10 11:53:46 + [2025-09-10 09:41:22] iteration 10510/ 11920 | consumed samples: 10762240 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807953E+00 | loss scale: 1.0 | grad norm: 0.137 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:12:04.172945 | finish at 2025-09-10 11:53:26 + [2025-09-10 09:41:28] iteration 10511/ 11920 | consumed samples: 10763264 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798235E+00 | loss scale: 1.0 | grad norm: 0.136 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:11:58.627205 | finish at 2025-09-10 11:53:27 + [2025-09-10 09:41:34] iteration 10512/ 11920 | consumed samples: 10764288 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808296E+00 | loss scale: 1.0 | grad norm: 0.137 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:11:53.101501 | finish at 2025-09-10 11:53:27 + [2025-09-10 09:41:39] iteration 10513/ 11920 | consumed samples: 10765312 | elapsed time per iteration (ms): 5619.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806363E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:11:45.983595 | finish at 2025-09-10 11:53:25 + [2025-09-10 09:41:45] iteration 10514/ 11920 | consumed samples: 10766336 | elapsed time per iteration (ms): 5616.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812958E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:11:36.383192 | finish at 2025-09-10 11:53:21 + [2025-09-10 09:41:51] iteration 10515/ 11920 | consumed samples: 10767360 | elapsed time per iteration (ms): 5973.4 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820418E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:19:52.619458 | finish at 2025-09-10 12:01:43 + [2025-09-10 09:41:56] iteration 10516/ 11920 | consumed samples: 10768384 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.787421E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:11:36.171418 | finish at 2025-09-10 11:53:33 + [2025-09-10 09:42:02] iteration 10517/ 11920 | consumed samples: 10769408 | elapsed time per iteration (ms): 5842.1 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813284E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:16:36.412647 | finish at 2025-09-10 11:58:39 + [2025-09-10 09:42:08] iteration 10518/ 11920 | consumed samples: 10770432 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805898E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:11:24.589716 | finish at 2025-09-10 11:53:32 + [2025-09-10 09:42:13] iteration 10519/ 11920 | consumed samples: 10771456 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817926E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:11:19.489317 | finish at 2025-09-10 11:53:33 + [2025-09-10 09:42:19] iteration 10520/ 11920 | consumed samples: 10772480 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816265E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:11:20.971766 | finish at 2025-09-10 11:53:40 + [2025-09-10 09:42:25] iteration 10521/ 11920 | consumed samples: 10773504 | elapsed time per iteration (ms): 5619.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798921E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:11:01.178068 | finish at 2025-09-10 11:53:26 + [2025-09-10 09:42:30] iteration 10522/ 11920 | consumed samples: 10774528 | elapsed time per iteration (ms): 5615.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810373E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:10:50.961927 | finish at 2025-09-10 11:53:21 + [2025-09-10 09:42:36] iteration 10523/ 11920 | consumed samples: 10775552 | elapsed time per iteration (ms): 5868.7 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809089E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:16:38.603915 | finish at 2025-09-10 11:59:15 + [2025-09-10 09:42:42] iteration 10524/ 11920 | consumed samples: 10776576 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.791516E+00 | loss scale: 1.0 | grad norm: 0.240 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:10:47.241582 | finish at 2025-09-10 11:53:29 + [2025-09-10 09:42:48] iteration 10525/ 11920 | consumed samples: 10777600 | elapsed time per iteration (ms): 5837.9 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803141E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:15:43.800881 | finish at 2025-09-10 11:58:31 + [2025-09-10 09:42:53] iteration 10526/ 11920 | consumed samples: 10778624 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807005E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:10:41.932658 | finish at 2025-09-10 11:53:35 + [2025-09-10 09:42:59] iteration 10527/ 11920 | consumed samples: 10779648 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812724E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:10:30.842514 | finish at 2025-09-10 11:53:30 + [2025-09-10 09:43:05] iteration 10528/ 11920 | consumed samples: 10780672 | elapsed time per iteration (ms): 5632.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803761E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:10:39.749268 | finish at 2025-09-10 11:53:44 + [2025-09-10 09:43:10] iteration 10529/ 11920 | consumed samples: 10781696 | elapsed time per iteration (ms): 5633.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802057E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:10:36.397622 | finish at 2025-09-10 11:53:47 + [2025-09-10 09:43:16] iteration 10530/ 11920 | consumed samples: 10782720 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811442E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:10:18.524315 | finish at 2025-09-10 11:53:34 + [2025-09-10 09:43:21] iteration 10531/ 11920 | consumed samples: 10783744 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.796700E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:10:14.997067 | finish at 2025-09-10 11:53:36 + [2025-09-10 09:43:27] iteration 10532/ 11920 | consumed samples: 10784768 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809659E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:10:00.475124 | finish at 2025-09-10 11:53:28 + [2025-09-10 09:43:33] iteration 10533/ 11920 | consumed samples: 10785792 | elapsed time per iteration (ms): 5616.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806406E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:09:49.657785 | finish at 2025-09-10 11:53:22 + [2025-09-10 09:43:38] iteration 10534/ 11920 | consumed samples: 10786816 | elapsed time per iteration (ms): 5618.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.797295E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:09:47.148467 | finish at 2025-09-10 11:53:25 + [2025-09-10 09:43:44] iteration 10535/ 11920 | consumed samples: 10787840 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814420E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:09:48.240225 | finish at 2025-09-10 11:53:32 + [2025-09-10 09:43:50] iteration 10536/ 11920 | consumed samples: 10788864 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810586E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:09:45.636515 | finish at 2025-09-10 11:53:35 + [2025-09-10 09:43:55] iteration 10537/ 11920 | consumed samples: 10789888 | elapsed time per iteration (ms): 5629.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810829E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:09:45.412409 | finish at 2025-09-10 11:53:41 + [2025-09-10 09:44:01] iteration 10538/ 11920 | consumed samples: 10790912 | elapsed time per iteration (ms): 5865.3 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817440E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:15:05.831667 | finish at 2025-09-10 11:59:07 + [2025-09-10 09:44:07] iteration 10539/ 11920 | consumed samples: 10791936 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.795123E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:09:25.855767 | finish at 2025-09-10 11:53:32 + [2025-09-10 09:44:12] iteration 10540/ 11920 | consumed samples: 10792960 | elapsed time per iteration (ms): 5634.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804660E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:09:35.071764 | finish at 2025-09-10 11:53:47 + [2025-09-10 09:44:18] iteration 10541/ 11920 | consumed samples: 10793984 | elapsed time per iteration (ms): 5630.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821101E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:09:24.825210 | finish at 2025-09-10 11:53:43 + [2025-09-10 09:44:24] iteration 10542/ 11920 | consumed samples: 10795008 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811320E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:09:11.024292 | finish at 2025-09-10 11:53:35 + [2025-09-10 09:44:29] iteration 10543/ 11920 | consumed samples: 10796032 | elapsed time per iteration (ms): 5628.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801288E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:09:09.874874 | finish at 2025-09-10 11:53:39 + [2025-09-10 09:44:35] iteration 10544/ 11920 | consumed samples: 10797056 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.788111E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:08:54.757538 | finish at 2025-09-10 11:53:30 + [2025-09-10 09:44:41] iteration 10545/ 11920 | consumed samples: 10798080 | elapsed time per iteration (ms): 5914.3 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807566E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:15:32.213920 | finish at 2025-09-10 12:00:13 + [2025-09-10 09:44:46] iteration 10546/ 11920 | consumed samples: 10799104 | elapsed time per iteration (ms): 5616.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801821E+00 | loss scale: 1.0 | grad norm: 0.142 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:08:36.857277 | finish at 2025-09-10 11:53:23 + [2025-09-10 09:44:52] iteration 10547/ 11920 | consumed samples: 10800128 | elapsed time per iteration (ms): 5618.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.787905E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:08:34.451896 | finish at 2025-09-10 11:53:26 + [2025-09-10 09:44:58] iteration 10548/ 11920 | consumed samples: 10801152 | elapsed time per iteration (ms): 5636.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.794494E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:08:53.386438 | finish at 2025-09-10 11:53:51 + [2025-09-10 09:45:03] iteration 10549/ 11920 | consumed samples: 10802176 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798831E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:08:28.744875 | finish at 2025-09-10 11:53:32 + [2025-09-10 09:45:09] iteration 10550/ 11920 | consumed samples: 10803200 | elapsed time per iteration (ms): 6182.0 | throughput per GPU (TFLOP/s/GPU): 73.0 | MFU 7.38% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804396E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:21:09.296777 | finish at 2025-09-10 12:06:19 + [2025-09-10 09:45:15] iteration 10551/ 11920 | consumed samples: 10804224 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800142E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:08:17.720411 | finish at 2025-09-10 11:53:33 + [2025-09-10 09:45:21] iteration 10552/ 11920 | consumed samples: 10805248 | elapsed time per iteration (ms): 5630.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.788123E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:08:22.741001 | finish at 2025-09-10 11:53:43 + [2025-09-10 09:45:26] iteration 10553/ 11920 | consumed samples: 10806272 | elapsed time per iteration (ms): 5619.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.794304E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:08:01.678768 | finish at 2025-09-10 11:53:28 + [2025-09-10 09:45:32] iteration 10554/ 11920 | consumed samples: 10807296 | elapsed time per iteration (ms): 5618.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811199E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:07:54.550849 | finish at 2025-09-10 11:53:26 + [2025-09-10 09:45:37] iteration 10555/ 11920 | consumed samples: 10808320 | elapsed time per iteration (ms): 5617.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811406E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:07:48.383889 | finish at 2025-09-10 11:53:26 + [2025-09-10 09:45:43] iteration 10556/ 11920 | consumed samples: 10809344 | elapsed time per iteration (ms): 5614.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799165E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:07:37.803429 | finish at 2025-09-10 11:53:21 + [2025-09-10 09:45:49] iteration 10557/ 11920 | consumed samples: 10810368 | elapsed time per iteration (ms): 5844.9 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814623E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:12:46.663222 | finish at 2025-09-10 11:58:36 + [2025-09-10 09:45:55] iteration 10558/ 11920 | consumed samples: 10811392 | elapsed time per iteration (ms): 6139.3 | throughput per GPU (TFLOP/s/GPU): 73.5 | MFU 7.44% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815234E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:19:21.680636 | finish at 2025-09-10 12:05:17 + [2025-09-10 09:46:01] iteration 10559/ 11920 | consumed samples: 10812416 | elapsed time per iteration (ms): 6196.1 | throughput per GPU (TFLOP/s/GPU): 72.9 | MFU 7.37% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816712E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:20:32.934928 | finish at 2025-09-10 12:06:34 + [2025-09-10 09:46:07] iteration 10560/ 11920 | consumed samples: 10813440 | elapsed time per iteration (ms): 5617.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802768E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:07:19.164886 | finish at 2025-09-10 11:53:26 + [2025-09-10 09:46:13] iteration 10561/ 11920 | consumed samples: 10814464 | elapsed time per iteration (ms): 5958.9 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808743E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:14:58.127569 | finish at 2025-09-10 12:01:11 + [2025-09-10 09:46:18] iteration 10562/ 11920 | consumed samples: 10815488 | elapsed time per iteration (ms): 5627.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.792779E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 13.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:07:21.536709 | finish at 2025-09-10 11:53:40 + [2025-09-10 09:46:24] iteration 10563/ 11920 | consumed samples: 10816512 | elapsed time per iteration (ms): 5633.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813407E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:07:24.298894 | finish at 2025-09-10 11:53:48 + [2025-09-10 09:46:30] iteration 10564/ 11920 | consumed samples: 10817536 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805676E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:07:07.592139 | finish at 2025-09-10 11:53:37 + [2025-09-10 09:46:35] iteration 10565/ 11920 | consumed samples: 10818560 | elapsed time per iteration (ms): 5626.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803942E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:07:04.270146 | finish at 2025-09-10 11:53:40 + [2025-09-10 09:46:41] iteration 10566/ 11920 | consumed samples: 10819584 | elapsed time per iteration (ms): 5626.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.791997E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:06:57.609713 | finish at 2025-09-10 11:53:39 + [2025-09-10 09:46:47] iteration 10567/ 11920 | consumed samples: 10820608 | elapsed time per iteration (ms): 5871.7 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814697E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:12:24.455972 | finish at 2025-09-10 11:59:11 + [2025-09-10 09:46:52] iteration 10568/ 11920 | consumed samples: 10821632 | elapsed time per iteration (ms): 5627.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806181E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:06:47.721210 | finish at 2025-09-10 11:53:40 + [2025-09-10 09:46:58] iteration 10569/ 11920 | consumed samples: 10822656 | elapsed time per iteration (ms): 5638.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803831E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:06:57.300382 | finish at 2025-09-10 11:53:55 + [2025-09-10 09:47:04] iteration 10570/ 11920 | consumed samples: 10823680 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812214E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:06:31.859365 | finish at 2025-09-10 11:53:36 + [2025-09-10 09:47:09] iteration 10571/ 11920 | consumed samples: 10824704 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.793068E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:06:21.807287 | finish at 2025-09-10 11:53:31 + [2025-09-10 09:47:15] iteration 10572/ 11920 | consumed samples: 10825728 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803257E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:06:22.254138 | finish at 2025-09-10 11:53:37 + [2025-09-10 09:47:21] iteration 10573/ 11920 | consumed samples: 10826752 | elapsed time per iteration (ms): 5640.7 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810153E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:06:38.001841 | finish at 2025-09-10 11:53:59 + [2025-09-10 09:47:27] iteration 10574/ 11920 | consumed samples: 10827776 | elapsed time per iteration (ms): 5932.3 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803715E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:13:04.866616 | finish at 2025-09-10 12:00:31 + [2025-09-10 09:47:32] iteration 10575/ 11920 | consumed samples: 10828800 | elapsed time per iteration (ms): 5618.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799812E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:05:56.595490 | finish at 2025-09-10 11:53:29 + [2025-09-10 09:47:38] iteration 10576/ 11920 | consumed samples: 10829824 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.794301E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:05:58.686859 | finish at 2025-09-10 11:53:37 + [2025-09-10 09:47:43] iteration 10577/ 11920 | consumed samples: 10830848 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798174E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:05:53.602367 | finish at 2025-09-10 11:53:37 + [2025-09-10 09:47:49] iteration 10578/ 11920 | consumed samples: 10831872 | elapsed time per iteration (ms): 5923.5 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812474E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:12:29.283649 | finish at 2025-09-10 12:00:19 + [2025-09-10 09:47:55] iteration 10579/ 11920 | consumed samples: 10832896 | elapsed time per iteration (ms): 5867.3 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808297E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:11:08.020890 | finish at 2025-09-10 11:59:03 + [2025-09-10 09:48:01] iteration 10580/ 11920 | consumed samples: 10833920 | elapsed time per iteration (ms): 5621.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.788906E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:05:32.124734 | finish at 2025-09-10 11:53:33 + [2025-09-10 09:48:06] iteration 10581/ 11920 | consumed samples: 10834944 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814840E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:05:26.451709 | finish at 2025-09-10 11:53:33 + [2025-09-10 09:48:12] iteration 10582/ 11920 | consumed samples: 10835968 | elapsed time per iteration (ms): 5626.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.793771E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:05:28.207090 | finish at 2025-09-10 11:53:40 + [2025-09-10 09:48:18] iteration 10583/ 11920 | consumed samples: 10836992 | elapsed time per iteration (ms): 5629.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802498E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:05:26.076211 | finish at 2025-09-10 11:53:44 + [2025-09-10 09:48:23] iteration 10584/ 11920 | consumed samples: 10838016 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812260E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:05:10.268278 | finish at 2025-09-10 11:53:34 + [2025-09-10 09:48:29] iteration 10585/ 11920 | consumed samples: 10839040 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.786161E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:05:03.332605 | finish at 2025-09-10 11:53:32 + [2025-09-10 09:48:35] iteration 10586/ 11920 | consumed samples: 10840064 | elapsed time per iteration (ms): 5966.2 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800209E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:12:38.856246 | finish at 2025-09-10 12:01:14 + [2025-09-10 09:48:41] iteration 10587/ 11920 | consumed samples: 10841088 | elapsed time per iteration (ms): 5842.1 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798874E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:09:47.576380 | finish at 2025-09-10 11:58:28 + [2025-09-10 09:48:47] iteration 10588/ 11920 | consumed samples: 10842112 | elapsed time per iteration (ms): 5924.8 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808680E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:11:31.846513 | finish at 2025-09-10 12:00:19 + [2025-09-10 09:48:52] iteration 10589/ 11920 | consumed samples: 10843136 | elapsed time per iteration (ms): 5616.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803590E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:04:35.508373 | finish at 2025-09-10 11:53:28 + [2025-09-10 09:48:58] iteration 10590/ 11920 | consumed samples: 10844160 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812244E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:04:39.939439 | finish at 2025-09-10 11:53:38 + [2025-09-10 09:49:04] iteration 10591/ 11920 | consumed samples: 10845184 | elapsed time per iteration (ms): 5851.6 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808264E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:09:36.826373 | finish at 2025-09-10 11:58:41 + [2025-09-10 09:49:09] iteration 10592/ 11920 | consumed samples: 10846208 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808384E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:04:23.922165 | finish at 2025-09-10 11:53:33 + [2025-09-10 09:49:15] iteration 10593/ 11920 | consumed samples: 10847232 | elapsed time per iteration (ms): 5623.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803324E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:04:21.699362 | finish at 2025-09-10 11:53:37 + [2025-09-10 09:49:21] iteration 10594/ 11920 | consumed samples: 10848256 | elapsed time per iteration (ms): 5839.1 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.795496E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:09:02.611038 | finish at 2025-09-10 11:58:23 + [2025-09-10 09:49:27] iteration 10595/ 11920 | consumed samples: 10849280 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814498E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:04:10.977165 | finish at 2025-09-10 11:53:37 + [2025-09-10 09:49:33] iteration 10596/ 11920 | consumed samples: 10850304 | elapsed time per iteration (ms): 6106.3 | throughput per GPU (TFLOP/s/GPU): 73.9 | MFU 7.48% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813400E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:14:44.686743 | finish at 2025-09-10 12:04:17 + [2025-09-10 09:49:38] iteration 10597/ 11920 | consumed samples: 10851328 | elapsed time per iteration (ms): 5626.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807762E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:04:03.207682 | finish at 2025-09-10 11:53:41 + [2025-09-10 09:49:44] iteration 10598/ 11920 | consumed samples: 10852352 | elapsed time per iteration (ms): 5618.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.797917E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:03:47.225498 | finish at 2025-09-10 11:53:31 + [2025-09-10 09:49:49] iteration 10599/ 11920 | consumed samples: 10853376 | elapsed time per iteration (ms): 5617.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811686E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:03:40.613024 | finish at 2025-09-10 11:53:30 + [2025-09-10 09:49:55] iteration 10600/ 11920 | consumed samples: 10854400 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805272E+00 | loss scale: 1.0 | grad norm: 0.241 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:03:39.518023 | finish at 2025-09-10 11:53:35 + [2025-09-10 09:50:01] iteration 10601/ 11920 | consumed samples: 10855424 | elapsed time per iteration (ms): 5618.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814649E+00 | loss scale: 1.0 | grad norm: 0.247 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:03:30.702748 | finish at 2025-09-10 11:53:31 + [2025-09-10 09:50:06] iteration 10602/ 11920 | consumed samples: 10856448 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810145E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:03:31.889096 | finish at 2025-09-10 11:53:38 + [2025-09-10 09:50:12] iteration 10603/ 11920 | consumed samples: 10857472 | elapsed time per iteration (ms): 5828.4 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798608E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:07:56.052589 | finish at 2025-09-10 11:58:08 + [2025-09-10 09:50:18] iteration 10604/ 11920 | consumed samples: 10858496 | elapsed time per iteration (ms): 5637.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.797066E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:03:38.903626 | finish at 2025-09-10 11:53:57 + [2025-09-10 09:50:23] iteration 10605/ 11920 | consumed samples: 10859520 | elapsed time per iteration (ms): 5628.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815132E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:03:21.935220 | finish at 2025-09-10 11:53:45 + [2025-09-10 09:50:29] iteration 10606/ 11920 | consumed samples: 10860544 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803754E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:03:12.652564 | finish at 2025-09-10 11:53:42 + [2025-09-10 09:50:35] iteration 10607/ 11920 | consumed samples: 10861568 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799735E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:03:03.291260 | finish at 2025-09-10 11:53:38 + [2025-09-10 09:50:40] iteration 10608/ 11920 | consumed samples: 10862592 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798890E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:02:55.548157 | finish at 2025-09-10 11:53:36 + [2025-09-10 09:50:46] iteration 10609/ 11920 | consumed samples: 10863616 | elapsed time per iteration (ms): 5955.2 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820613E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:10:07.324343 | finish at 2025-09-10 12:00:54 + [2025-09-10 09:50:52] iteration 10610/ 11920 | consumed samples: 10864640 | elapsed time per iteration (ms): 5859.3 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816218E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:07:55.644138 | finish at 2025-09-10 11:58:48 + [2025-09-10 09:50:58] iteration 10611/ 11920 | consumed samples: 10865664 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813715E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:02:42.395646 | finish at 2025-09-10 11:53:40 + [2025-09-10 09:51:03] iteration 10612/ 11920 | consumed samples: 10866688 | elapsed time per iteration (ms): 5622.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809423E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:02:34.277638 | finish at 2025-09-10 11:53:38 + [2025-09-10 09:51:09] iteration 10613/ 11920 | consumed samples: 10867712 | elapsed time per iteration (ms): 5626.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804654E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:02:34.209605 | finish at 2025-09-10 11:53:43 + [2025-09-10 09:51:15] iteration 10614/ 11920 | consumed samples: 10868736 | elapsed time per iteration (ms): 5627.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800805E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:02:28.808877 | finish at 2025-09-10 11:53:43 + [2025-09-10 09:51:20] iteration 10615/ 11920 | consumed samples: 10869760 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806199E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:02:17.573376 | finish at 2025-09-10 11:53:38 + [2025-09-10 09:51:26] iteration 10616/ 11920 | consumed samples: 10870784 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805812E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:02:09.631727 | finish at 2025-09-10 11:53:36 + [2025-09-10 09:51:32] iteration 10617/ 11920 | consumed samples: 10871808 | elapsed time per iteration (ms): 5933.6 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822482E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:08:51.514595 | finish at 2025-09-10 12:00:23 + [2025-09-10 09:51:37] iteration 10618/ 11920 | consumed samples: 10872832 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.797769E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:02:00.409249 | finish at 2025-09-10 11:53:38 + [2025-09-10 09:51:43] iteration 10619/ 11920 | consumed samples: 10873856 | elapsed time per iteration (ms): 5619.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798601E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:01:50.386565 | finish at 2025-09-10 11:53:33 + [2025-09-10 09:51:49] iteration 10620/ 11920 | consumed samples: 10874880 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801130E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:01:45.401039 | finish at 2025-09-10 11:53:34 + [2025-09-10 09:51:54] iteration 10621/ 11920 | consumed samples: 10875904 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798112E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:01:44.611670 | finish at 2025-09-10 11:53:39 + [2025-09-10 09:52:00] iteration 10622/ 11920 | consumed samples: 10876928 | elapsed time per iteration (ms): 5615.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802881E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:01:28.848407 | finish at 2025-09-10 11:53:29 + [2025-09-10 09:52:06] iteration 10623/ 11920 | consumed samples: 10877952 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.783223E+00 | loss scale: 1.0 | grad norm: 0.131 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:01:33.261254 | finish at 2025-09-10 11:53:39 + [2025-09-10 09:52:11] iteration 10624/ 11920 | consumed samples: 10878976 | elapsed time per iteration (ms): 5628.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804015E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:01:33.896370 | finish at 2025-09-10 11:53:45 + [2025-09-10 09:52:17] iteration 10625/ 11920 | consumed samples: 10880000 | elapsed time per iteration (ms): 5626.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.788570E+00 | loss scale: 1.0 | grad norm: 0.142 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:01:25.697076 | finish at 2025-09-10 11:53:42 + [2025-09-10 09:52:22] iteration 10626/ 11920 | consumed samples: 10881024 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807169E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:01:17.653234 | finish at 2025-09-10 11:53:40 + [2025-09-10 09:52:28] iteration 10627/ 11920 | consumed samples: 10882048 | elapsed time per iteration (ms): 5616.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812314E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:01:01.554198 | finish at 2025-09-10 11:53:30 + [2025-09-10 09:52:34] iteration 10628/ 11920 | consumed samples: 10883072 | elapsed time per iteration (ms): 5628.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802980E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:01:11.427161 | finish at 2025-09-10 11:53:45 + [2025-09-10 09:52:39] iteration 10629/ 11920 | consumed samples: 10884096 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802012E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:00:57.627690 | finish at 2025-09-10 11:53:37 + [2025-09-10 09:52:45] iteration 10630/ 11920 | consumed samples: 10885120 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805332E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:00:51.573551 | finish at 2025-09-10 11:53:36 + [2025-09-10 09:52:51] iteration 10631/ 11920 | consumed samples: 10886144 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810217E+00 | loss scale: 1.0 | grad norm: 0.271 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:00:48.412899 | finish at 2025-09-10 11:53:39 + [2025-09-10 09:52:56] iteration 10632/ 11920 | consumed samples: 10887168 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812575E+00 | loss scale: 1.0 | grad norm: 0.276 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:00:45.787975 | finish at 2025-09-10 11:53:42 + [2025-09-10 09:53:02] iteration 10633/ 11920 | consumed samples: 10888192 | elapsed time per iteration (ms): 5977.9 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798617E+00 | loss scale: 1.0 | grad norm: 0.296 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:08:13.552427 | finish at 2025-09-10 12:01:16 + [2025-09-10 09:53:08] iteration 10634/ 11920 | consumed samples: 10889216 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816954E+00 | loss scale: 1.0 | grad norm: 0.269 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:00:27.195064 | finish at 2025-09-10 11:53:35 + [2025-09-10 09:53:13] iteration 10635/ 11920 | consumed samples: 10890240 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810768E+00 | loss scale: 1.0 | grad norm: 0.261 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:00:27.414533 | finish at 2025-09-10 11:53:41 + [2025-09-10 09:53:19] iteration 10636/ 11920 | consumed samples: 10891264 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831233E+00 | loss scale: 1.0 | grad norm: 0.249 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:00:17.445497 | finish at 2025-09-10 11:53:36 + [2025-09-10 09:53:25] iteration 10637/ 11920 | consumed samples: 10892288 | elapsed time per iteration (ms): 5993.6 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826566E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:08:09.783767 | finish at 2025-09-10 12:01:35 + [2025-09-10 09:53:31] iteration 10638/ 11920 | consumed samples: 10893312 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811441E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:00:10.306450 | finish at 2025-09-10 11:53:41 + [2025-09-10 09:53:36] iteration 10639/ 11920 | consumed samples: 10894336 | elapsed time per iteration (ms): 5879.6 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824870E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:05:31.831188 | finish at 2025-09-10 11:59:08 + [2025-09-10 09:53:42] iteration 10640/ 11920 | consumed samples: 10895360 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815095E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:59:56.598511 | finish at 2025-09-10 11:53:39 + [2025-09-10 09:53:48] iteration 10641/ 11920 | consumed samples: 10896384 | elapsed time per iteration (ms): 5850.1 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811615E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:04:42.286490 | finish at 2025-09-10 11:58:30 + [2025-09-10 09:53:54] iteration 10642/ 11920 | consumed samples: 10897408 | elapsed time per iteration (ms): 5630.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803093E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:59:55.265073 | finish at 2025-09-10 11:53:49 + [2025-09-10 09:53:59] iteration 10643/ 11920 | consumed samples: 10898432 | elapsed time per iteration (ms): 5618.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813897E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:59:34.547129 | finish at 2025-09-10 11:53:34 + [2025-09-10 09:54:05] iteration 10644/ 11920 | consumed samples: 10899456 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804987E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:59:32.051686 | finish at 2025-09-10 11:53:37 + [2025-09-10 09:54:10] iteration 10645/ 11920 | consumed samples: 10900480 | elapsed time per iteration (ms): 5631.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803496E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:59:39.782528 | finish at 2025-09-10 11:53:50 + [2025-09-10 09:54:16] iteration 10646/ 11920 | consumed samples: 10901504 | elapsed time per iteration (ms): 5833.1 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810265E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:03:51.352797 | finish at 2025-09-10 11:58:08 + [2025-09-10 09:54:22] iteration 10647/ 11920 | consumed samples: 10902528 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.787108E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:59:18.871944 | finish at 2025-09-10 11:53:41 + [2025-09-10 09:54:28] iteration 10648/ 11920 | consumed samples: 10903552 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798537E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:59:10.582289 | finish at 2025-09-10 11:53:38 + [2025-09-10 09:54:33] iteration 10649/ 11920 | consumed samples: 10904576 | elapsed time per iteration (ms): 5628.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805495E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:59:14.117420 | finish at 2025-09-10 11:53:47 + [2025-09-10 09:54:39] iteration 10650/ 11920 | consumed samples: 10905600 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802299E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:59:04.488206 | finish at 2025-09-10 11:53:43 + [2025-09-10 09:54:44] iteration 10651/ 11920 | consumed samples: 10906624 | elapsed time per iteration (ms): 5617.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.789847E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:58:48.896220 | finish at 2025-09-10 11:53:33 + [2025-09-10 09:54:50] iteration 10652/ 11920 | consumed samples: 10907648 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804435E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:58:47.087961 | finish at 2025-09-10 11:53:37 + [2025-09-10 09:54:56] iteration 10653/ 11920 | consumed samples: 10908672 | elapsed time per iteration (ms): 5617.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803963E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:58:37.110986 | finish at 2025-09-10 11:53:33 + [2025-09-10 09:55:01] iteration 10654/ 11920 | consumed samples: 10909696 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814186E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:58:35.035761 | finish at 2025-09-10 11:53:36 + [2025-09-10 09:55:07] iteration 10655/ 11920 | consumed samples: 10910720 | elapsed time per iteration (ms): 5619.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810587E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:58:28.027407 | finish at 2025-09-10 11:53:35 + [2025-09-10 09:55:13] iteration 10656/ 11920 | consumed samples: 10911744 | elapsed time per iteration (ms): 5910.9 | throughput per GPU (TFLOP/s/GPU): 76.4 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811700E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:04:31.404266 | finish at 2025-09-10 11:59:44 + [2025-09-10 09:55:18] iteration 10657/ 11920 | consumed samples: 10912768 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801204E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:58:17.464536 | finish at 2025-09-10 11:53:36 + [2025-09-10 09:55:24] iteration 10658/ 11920 | consumed samples: 10913792 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812699E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:58:14.582151 | finish at 2025-09-10 11:53:39 + [2025-09-10 09:55:30] iteration 10659/ 11920 | consumed samples: 10914816 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.793343E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:58:11.716174 | finish at 2025-09-10 11:53:41 + [2025-09-10 09:55:35] iteration 10660/ 11920 | consumed samples: 10915840 | elapsed time per iteration (ms): 5626.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.784131E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:58:09.899054 | finish at 2025-09-10 11:53:45 + [2025-09-10 09:55:41] iteration 10661/ 11920 | consumed samples: 10916864 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826179E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 10.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:57:58.201532 | finish at 2025-09-10 11:53:39 + [2025-09-10 09:55:47] iteration 10662/ 11920 | consumed samples: 10917888 | elapsed time per iteration (ms): 5847.5 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800711E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:02:36.215094 | finish at 2025-09-10 11:58:23 + [2025-09-10 09:55:52] iteration 10663/ 11920 | consumed samples: 10918912 | elapsed time per iteration (ms): 5627.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806250E+00 | loss scale: 1.0 | grad norm: 0.250 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:57:54.153875 | finish at 2025-09-10 11:53:47 + [2025-09-10 09:55:58] iteration 10664/ 11920 | consumed samples: 10919936 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826446E+00 | loss scale: 1.0 | grad norm: 0.277 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:57:45.089836 | finish at 2025-09-10 11:53:43 + [2025-09-10 09:56:04] iteration 10665/ 11920 | consumed samples: 10920960 | elapsed time per iteration (ms): 5992.8 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808999E+00 | loss scale: 1.0 | grad norm: 0.280 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:05:20.950831 | finish at 2025-09-10 12:01:25 + [2025-09-10 09:56:10] iteration 10666/ 11920 | consumed samples: 10921984 | elapsed time per iteration (ms): 5630.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807302E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:57:41.019922 | finish at 2025-09-10 11:53:51 + [2025-09-10 09:56:15] iteration 10667/ 11920 | consumed samples: 10923008 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.792837E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:57:29.527278 | finish at 2025-09-10 11:53:45 + [2025-09-10 09:56:21] iteration 10668/ 11920 | consumed samples: 10924032 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.793867E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:57:21.649573 | finish at 2025-09-10 11:53:43 + [2025-09-10 09:56:27] iteration 10669/ 11920 | consumed samples: 10925056 | elapsed time per iteration (ms): 5629.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803426E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:57:21.940377 | finish at 2025-09-10 11:53:48 + [2025-09-10 09:56:32] iteration 10670/ 11920 | consumed samples: 10926080 | elapsed time per iteration (ms): 5934.1 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803260E+00 | loss scale: 1.0 | grad norm: 0.309 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:03:37.609692 | finish at 2025-09-10 12:00:10 + [2025-09-10 09:56:38] iteration 10671/ 11920 | consumed samples: 10927104 | elapsed time per iteration (ms): 5631.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825076E+00 | loss scale: 1.0 | grad norm: 0.546 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:57:13.884955 | finish at 2025-09-10 11:53:52 + [2025-09-10 09:56:44] iteration 10672/ 11920 | consumed samples: 10928128 | elapsed time per iteration (ms): 5953.2 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.852661E+00 | loss scale: 1.0 | grad norm: 0.422 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:03:49.630028 | finish at 2025-09-10 12:00:34 + [2025-09-10 09:56:50] iteration 10673/ 11920 | consumed samples: 10929152 | elapsed time per iteration (ms): 5971.2 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.885418E+00 | loss scale: 1.0 | grad norm: 0.467 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:04:06.081630 | finish at 2025-09-10 12:00:56 + [2025-09-10 09:56:56] iteration 10674/ 11920 | consumed samples: 10930176 | elapsed time per iteration (ms): 5639.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.894763E+00 | loss scale: 1.0 | grad norm: 0.569 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:57:06.757168 | finish at 2025-09-10 11:54:02 + [2025-09-10 09:57:01] iteration 10675/ 11920 | consumed samples: 10931200 | elapsed time per iteration (ms): 5635.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.875535E+00 | loss scale: 1.0 | grad norm: 0.468 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:56:56.687808 | finish at 2025-09-10 11:53:58 + [2025-09-10 09:57:07] iteration 10676/ 11920 | consumed samples: 10932224 | elapsed time per iteration (ms): 5649.9 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.893192E+00 | loss scale: 1.0 | grad norm: 0.785 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:57:08.458051 | finish at 2025-09-10 11:54:15 + [2025-09-10 09:57:13] iteration 10677/ 11920 | consumed samples: 10933248 | elapsed time per iteration (ms): 5644.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.915496E+00 | loss scale: 1.0 | grad norm: 1.175 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:56:55.922076 | finish at 2025-09-10 11:54:09 + [2025-09-10 09:57:18] iteration 10678/ 11920 | consumed samples: 10934272 | elapsed time per iteration (ms): 5643.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.953694E+00 | loss scale: 1.0 | grad norm: 1.298 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:56:48.987850 | finish at 2025-09-10 11:54:07 + [2025-09-10 09:57:24] iteration 10679/ 11920 | consumed samples: 10935296 | elapsed time per iteration (ms): 5688.2 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.289833E+00 | loss scale: 1.0 | grad norm: 9.264 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:57:38.995541 | finish at 2025-09-10 11:55:03 + [2025-09-10 09:57:30] iteration 10680/ 11920 | consumed samples: 10936320 | elapsed time per iteration (ms): 5682.9 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.320207E+00 | loss scale: 1.0 | grad norm: 7.834 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:57:26.786480 | finish at 2025-09-10 11:54:56 + [2025-09-10 09:57:35] iteration 10681/ 11920 | consumed samples: 10937344 | elapsed time per iteration (ms): 5758.4 | throughput per GPU (TFLOP/s/GPU): 78.4 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.855638E+00 | loss scale: 1.0 | grad norm: 14.059 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:58:54.683254 | finish at 2025-09-10 11:56:30 + [2025-09-10 09:57:41] iteration 10682/ 11920 | consumed samples: 10938368 | elapsed time per iteration (ms): 5754.0 | throughput per GPU (TFLOP/s/GPU): 78.5 | MFU 7.93% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.764907E+00 | loss scale: 1.0 | grad norm: 6.219 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:58:43.455183 | finish at 2025-09-10 11:56:25 + [2025-09-10 09:57:47] iteration 10683/ 11920 | consumed samples: 10939392 | elapsed time per iteration (ms): 5702.5 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.650914E+00 | loss scale: 1.0 | grad norm: 1.937 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:57:33.982013 | finish at 2025-09-10 11:55:21 + [2025-09-10 09:57:53] iteration 10684/ 11920 | consumed samples: 10940416 | elapsed time per iteration (ms): 5719.2 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.653840E+00 | loss scale: 1.0 | grad norm: 2.476 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:57:48.989714 | finish at 2025-09-10 11:55:42 + [2025-09-10 09:57:58] iteration 10685/ 11920 | consumed samples: 10941440 | elapsed time per iteration (ms): 5686.9 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.560044E+00 | loss scale: 1.0 | grad norm: 1.046 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:57:03.272204 | finish at 2025-09-10 11:55:01 + [2025-09-10 09:58:04] iteration 10686/ 11920 | consumed samples: 10942464 | elapsed time per iteration (ms): 5690.8 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.536522E+00 | loss scale: 1.0 | grad norm: 1.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:57:02.508041 | finish at 2025-09-10 11:55:06 + [2025-09-10 09:58:10] iteration 10687/ 11920 | consumed samples: 10943488 | elapsed time per iteration (ms): 5733.9 | throughput per GPU (TFLOP/s/GPU): 78.7 | MFU 7.96% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.110701E+00 | loss scale: 1.0 | grad norm: 9.579 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:57:49.907606 | finish at 2025-09-10 11:56:00 + [2025-09-10 09:58:15] iteration 10688/ 11920 | consumed samples: 10944512 | elapsed time per iteration (ms): 5709.1 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 8.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.687785E+00 | loss scale: 1.0 | grad norm: 2.455 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:57:13.584774 | finish at 2025-09-10 11:55:29 + [2025-09-10 09:58:21] iteration 10689/ 11920 | consumed samples: 10945536 | elapsed time per iteration (ms): 5923.6 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.589273E+00 | loss scale: 1.0 | grad norm: 1.696 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:01:31.899894 | finish at 2025-09-10 11:59:53 + [2025-09-10 09:58:27] iteration 10690/ 11920 | consumed samples: 10946560 | elapsed time per iteration (ms): 5709.1 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 8.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.539363E+00 | loss scale: 1.0 | grad norm: 1.585 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:57:02.144036 | finish at 2025-09-10 11:55:29 + [2025-09-10 09:58:33] iteration 10691/ 11920 | consumed samples: 10947584 | elapsed time per iteration (ms): 6079.3 | throughput per GPU (TFLOP/s/GPU): 74.3 | MFU 7.51% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.551088E+00 | loss scale: 1.0 | grad norm: 1.885 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:04:31.467814 | finish at 2025-09-10 12:03:05 + [2025-09-10 09:58:39] iteration 10692/ 11920 | consumed samples: 10948608 | elapsed time per iteration (ms): 5714.7 | throughput per GPU (TFLOP/s/GPU): 79.0 | MFU 7.99% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.500576E+00 | loss scale: 1.0 | grad norm: 1.641 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:56:57.623180 | finish at 2025-09-10 11:55:36 + [2025-09-10 09:58:44] iteration 10693/ 11920 | consumed samples: 10949632 | elapsed time per iteration (ms): 5719.9 | throughput per GPU (TFLOP/s/GPU): 78.9 | MFU 7.98% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.567236E+00 | loss scale: 1.0 | grad norm: 3.410 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:56:58.339987 | finish at 2025-09-10 11:55:43 + [2025-09-10 09:58:50] iteration 10694/ 11920 | consumed samples: 10950656 | elapsed time per iteration (ms): 5695.2 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.505969E+00 | loss scale: 1.0 | grad norm: 1.254 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:56:22.329774 | finish at 2025-09-10 11:55:13 + [2025-09-10 09:58:56] iteration 10695/ 11920 | consumed samples: 10951680 | elapsed time per iteration (ms): 5697.6 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.01% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.538136E+00 | loss scale: 1.0 | grad norm: 2.920 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:56:19.573882 | finish at 2025-09-10 11:55:15 + [2025-09-10 09:59:02] iteration 10696/ 11920 | consumed samples: 10952704 | elapsed time per iteration (ms): 5705.4 | throughput per GPU (TFLOP/s/GPU): 79.1 | MFU 8.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.562722E+00 | loss scale: 1.0 | grad norm: 3.105 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:56:23.467077 | finish at 2025-09-10 11:55:25 + [2025-09-10 09:59:07] iteration 10697/ 11920 | consumed samples: 10953728 | elapsed time per iteration (ms): 5673.6 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.418340E+00 | loss scale: 1.0 | grad norm: 0.530 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:55:38.784174 | finish at 2025-09-10 11:54:46 + [2025-09-10 09:59:13] iteration 10698/ 11920 | consumed samples: 10954752 | elapsed time per iteration (ms): 5685.8 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.459880E+00 | loss scale: 1.0 | grad norm: 1.289 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:55:48.029337 | finish at 2025-09-10 11:55:01 + [2025-09-10 09:59:19] iteration 10699/ 11920 | consumed samples: 10955776 | elapsed time per iteration (ms): 5704.1 | throughput per GPU (TFLOP/s/GPU): 79.2 | MFU 8.00% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.748656E+00 | loss scale: 1.0 | grad norm: 5.759 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:56:04.690541 | finish at 2025-09-10 11:55:23 + [2025-09-10 09:59:24] iteration 10700/ 11920 | consumed samples: 10956800 | elapsed time per iteration (ms): 5683.7 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.651574E+00 | loss scale: 1.0 | grad norm: 2.186 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:55:34.129810 | finish at 2025-09-10 11:54:58 + [2025-09-10 09:59:30] iteration 10701/ 11920 | consumed samples: 10957824 | elapsed time per iteration (ms): 5997.3 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.612998E+00 | loss scale: 1.0 | grad norm: 1.954 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:01:50.691733 | finish at 2025-09-10 12:01:21 + [2025-09-10 09:59:36] iteration 10702/ 11920 | consumed samples: 10958848 | elapsed time per iteration (ms): 5655.1 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.485311E+00 | loss scale: 1.0 | grad norm: 0.878 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:54:47.889861 | finish at 2025-09-10 11:54:24 + [2025-09-10 09:59:42] iteration 10703/ 11920 | consumed samples: 10959872 | elapsed time per iteration (ms): 5695.4 | throughput per GPU (TFLOP/s/GPU): 79.3 | MFU 8.02% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.604880E+00 | loss scale: 1.0 | grad norm: 2.738 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:55:31.348515 | finish at 2025-09-10 11:55:13 + [2025-09-10 09:59:47] iteration 10704/ 11920 | consumed samples: 10960896 | elapsed time per iteration (ms): 5676.9 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.04% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.519296E+00 | loss scale: 1.0 | grad norm: 1.756 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:55:03.051727 | finish at 2025-09-10 11:54:50 + [2025-09-10 09:59:53] iteration 10705/ 11920 | consumed samples: 10961920 | elapsed time per iteration (ms): 5740.6 | throughput per GPU (TFLOP/s/GPU): 78.6 | MFU 7.95% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 4.096283E+00 | loss scale: 1.0 | grad norm: 11.698 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:56:14.805089 | finish at 2025-09-10 11:56:08 + [2025-09-10 09:59:59] iteration 10706/ 11920 | consumed samples: 10962944 | elapsed time per iteration (ms): 5884.4 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.546764E+00 | loss scale: 1.0 | grad norm: 1.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:59:03.680571 | finish at 2025-09-10 11:59:03 + [2025-09-10 10:00:05] iteration 10707/ 11920 | consumed samples: 10963968 | elapsed time per iteration (ms): 5685.8 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.481507E+00 | loss scale: 1.0 | grad norm: 1.051 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:54:56.826617 | finish at 2025-09-10 11:55:02 + [2025-09-10 10:00:10] iteration 10708/ 11920 | consumed samples: 10964992 | elapsed time per iteration (ms): 5688.3 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.442437E+00 | loss scale: 1.0 | grad norm: 1.174 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:54:54.276398 | finish at 2025-09-10 11:55:05 + [2025-09-10 10:00:16] iteration 10709/ 11920 | consumed samples: 10966016 | elapsed time per iteration (ms): 5930.7 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.673197E+00 | loss scale: 1.0 | grad norm: 4.755 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:59:42.073735 | finish at 2025-09-10 11:59:58 + [2025-09-10 10:00:22] iteration 10710/ 11920 | consumed samples: 10967040 | elapsed time per iteration (ms): 5875.0 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.469592E+00 | loss scale: 1.0 | grad norm: 1.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:58:28.693745 | finish at 2025-09-10 11:58:51 + [2025-09-10 10:00:28] iteration 10711/ 11920 | consumed samples: 10968064 | elapsed time per iteration (ms): 5666.8 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.413272E+00 | loss scale: 1.0 | grad norm: 0.840 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:54:11.143355 | finish at 2025-09-10 11:54:39 + [2025-09-10 10:00:34] iteration 10712/ 11920 | consumed samples: 10969088 | elapsed time per iteration (ms): 5669.4 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.361755E+00 | loss scale: 1.0 | grad norm: 0.648 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:54:08.665413 | finish at 2025-09-10 11:54:42 + [2025-09-10 10:00:39] iteration 10713/ 11920 | consumed samples: 10970112 | elapsed time per iteration (ms): 5651.4 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.337038E+00 | loss scale: 1.0 | grad norm: 0.983 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:53:41.249404 | finish at 2025-09-10 11:54:20 + [2025-09-10 10:00:45] iteration 10714/ 11920 | consumed samples: 10971136 | elapsed time per iteration (ms): 5657.6 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.345994E+00 | loss scale: 1.0 | grad norm: 1.375 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:53:43.062060 | finish at 2025-09-10 11:54:28 + [2025-09-10 10:00:51] iteration 10715/ 11920 | consumed samples: 10972160 | elapsed time per iteration (ms): 6006.2 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.328564E+00 | loss scale: 1.0 | grad norm: 1.793 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:00:37.431444 | finish at 2025-09-10 12:01:28 + [2025-09-10 10:00:56] iteration 10716/ 11920 | consumed samples: 10973184 | elapsed time per iteration (ms): 5659.4 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.328307E+00 | loss scale: 1.0 | grad norm: 1.558 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:53:33.901795 | finish at 2025-09-10 11:54:30 + [2025-09-10 10:01:02] iteration 10717/ 11920 | consumed samples: 10974208 | elapsed time per iteration (ms): 5646.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.304597E+00 | loss scale: 1.0 | grad norm: 0.892 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:53:12.468877 | finish at 2025-09-10 11:54:15 + [2025-09-10 10:01:08] iteration 10718/ 11920 | consumed samples: 10975232 | elapsed time per iteration (ms): 5654.1 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.252995E+00 | loss scale: 1.0 | grad norm: 0.547 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:53:16.278567 | finish at 2025-09-10 11:54:24 + [2025-09-10 10:01:13] iteration 10719/ 11920 | consumed samples: 10976256 | elapsed time per iteration (ms): 5657.1 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.221920E+00 | loss scale: 1.0 | grad norm: 0.551 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:53:14.122650 | finish at 2025-09-10 11:54:28 + [2025-09-10 10:01:19] iteration 10720/ 11920 | consumed samples: 10977280 | elapsed time per iteration (ms): 6002.4 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.214501E+00 | loss scale: 1.0 | grad norm: 1.393 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:00:02.863598 | finish at 2025-09-10 12:01:22 + [2025-09-10 10:01:25] iteration 10721/ 11920 | consumed samples: 10978304 | elapsed time per iteration (ms): 5649.3 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.210815E+00 | loss scale: 1.0 | grad norm: 1.118 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:52:53.518822 | finish at 2025-09-10 11:54:19 + [2025-09-10 10:01:31] iteration 10722/ 11920 | consumed samples: 10979328 | elapsed time per iteration (ms): 5657.8 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.184042E+00 | loss scale: 1.0 | grad norm: 0.644 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:52:58.003792 | finish at 2025-09-10 11:54:29 + [2025-09-10 10:01:36] iteration 10723/ 11920 | consumed samples: 10980352 | elapsed time per iteration (ms): 5659.5 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.175382E+00 | loss scale: 1.0 | grad norm: 1.064 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:52:54.462456 | finish at 2025-09-10 11:54:31 + [2025-09-10 10:01:42] iteration 10724/ 11920 | consumed samples: 10981376 | elapsed time per iteration (ms): 5647.4 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.164422E+00 | loss scale: 1.0 | grad norm: 0.797 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:52:34.252073 | finish at 2025-09-10 11:54:16 + [2025-09-10 10:01:48] iteration 10725/ 11920 | consumed samples: 10982400 | elapsed time per iteration (ms): 5654.0 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.135825E+00 | loss scale: 1.0 | grad norm: 0.813 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:52:36.510109 | finish at 2025-09-10 11:54:24 + [2025-09-10 10:01:54] iteration 10726/ 11920 | consumed samples: 10983424 | elapsed time per iteration (ms): 5908.4 | throughput per GPU (TFLOP/s/GPU): 76.4 | MFU 7.73% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.153204E+00 | loss scale: 1.0 | grad norm: 1.556 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:57:34.619707 | finish at 2025-09-10 11:59:28 + [2025-09-10 10:02:00] iteration 10727/ 11920 | consumed samples: 10984448 | elapsed time per iteration (ms): 5902.2 | throughput per GPU (TFLOP/s/GPU): 76.5 | MFU 7.73% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.125942E+00 | loss scale: 1.0 | grad norm: 1.111 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:57:21.329985 | finish at 2025-09-10 11:59:21 + [2025-09-10 10:02:05] iteration 10728/ 11920 | consumed samples: 10985472 | elapsed time per iteration (ms): 5640.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.132293E+00 | loss scale: 1.0 | grad norm: 0.779 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:52:03.248726 | finish at 2025-09-10 11:54:08 +(min, max) time across ranks (ms): + save-checkpoint ................................: (4239.25, 4239.30) + [2025-09-10 10:02:15] iteration 10729/ 11920 | consumed samples: 10986496 | elapsed time per iteration (ms): 5862.3 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.103644E+00 | loss scale: 1.0 | grad norm: 0.536 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:56:22.005167 | finish at 2025-09-10 11:58:37 + [2025-09-10 10:02:21] iteration 10730/ 11920 | consumed samples: 10987520 | elapsed time per iteration (ms): 5952.9 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.076963E+00 | loss scale: 1.0 | grad norm: 0.732 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:58:03.931344 | finish at 2025-09-10 12:00:25 + [2025-09-10 10:02:27] iteration 10731/ 11920 | consumed samples: 10988544 | elapsed time per iteration (ms): 5646.2 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.076762E+00 | loss scale: 1.0 | grad norm: 0.689 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:51:53.307635 | finish at 2025-09-10 11:54:20 + [2025-09-10 10:02:33] iteration 10732/ 11920 | consumed samples: 10989568 | elapsed time per iteration (ms): 5648.2 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.075038E+00 | loss scale: 1.0 | grad norm: 0.933 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:51:50.045780 | finish at 2025-09-10 11:54:23 + [2025-09-10 10:02:38] iteration 10733/ 11920 | consumed samples: 10990592 | elapsed time per iteration (ms): 5642.5 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.083747E+00 | loss scale: 1.0 | grad norm: 1.515 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:51:37.666088 | finish at 2025-09-10 11:54:16 + [2025-09-10 10:02:44] iteration 10734/ 11920 | consumed samples: 10991616 | elapsed time per iteration (ms): 5647.3 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.105521E+00 | loss scale: 1.0 | grad norm: 1.874 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:51:37.664157 | finish at 2025-09-10 11:54:21 + [2025-09-10 10:02:49] iteration 10735/ 11920 | consumed samples: 10992640 | elapsed time per iteration (ms): 5633.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.060622E+00 | loss scale: 1.0 | grad norm: 0.588 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:51:15.120134 | finish at 2025-09-10 11:54:05 + [2025-09-10 10:02:55] iteration 10736/ 11920 | consumed samples: 10993664 | elapsed time per iteration (ms): 5638.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.063171E+00 | loss scale: 1.0 | grad norm: 0.929 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:51:16.271622 | finish at 2025-09-10 11:54:11 + [2025-09-10 10:03:01] iteration 10737/ 11920 | consumed samples: 10994688 | elapsed time per iteration (ms): 5972.8 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.048404E+00 | loss scale: 1.0 | grad norm: 0.741 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:57:45.858804 | finish at 2025-09-10 12:00:47 + [2025-09-10 10:03:07] iteration 10738/ 11920 | consumed samples: 10995712 | elapsed time per iteration (ms): 5635.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.035635E+00 | loss scale: 1.0 | grad norm: 0.709 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:51:00.825591 | finish at 2025-09-10 11:54:08 + [2025-09-10 10:03:12] iteration 10739/ 11920 | consumed samples: 10996736 | elapsed time per iteration (ms): 5640.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.025483E+00 | loss scale: 1.0 | grad norm: 0.596 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:51:00.928819 | finish at 2025-09-10 11:54:13 + [2025-09-10 10:03:18] iteration 10740/ 11920 | consumed samples: 10997760 | elapsed time per iteration (ms): 5648.5 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.025218E+00 | loss scale: 1.0 | grad norm: 0.672 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:51:05.288758 | finish at 2025-09-10 11:54:23 + [2025-09-10 10:03:24] iteration 10741/ 11920 | consumed samples: 10998784 | elapsed time per iteration (ms): 6004.4 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.982543E+00 | loss scale: 1.0 | grad norm: 0.455 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:57:59.196894 | finish at 2025-09-10 12:01:23 + [2025-09-10 10:03:30] iteration 10742/ 11920 | consumed samples: 10999808 | elapsed time per iteration (ms): 5646.2 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.999955E+00 | loss scale: 1.0 | grad norm: 1.016 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:50:51.261166 | finish at 2025-09-10 11:54:21 + [2025-09-10 10:03:35] iteration 10743/ 11920 | consumed samples: 11000832 | elapsed time per iteration (ms): 5642.2 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.977659E+00 | loss scale: 1.0 | grad norm: 0.408 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:50:40.926919 | finish at 2025-09-10 11:54:16 + [2025-09-10 10:03:41] iteration 10744/ 11920 | consumed samples: 11001856 | elapsed time per iteration (ms): 5632.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.982256E+00 | loss scale: 1.0 | grad norm: 0.447 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:50:23.993477 | finish at 2025-09-10 11:54:05 + [2025-09-10 10:03:47] iteration 10745/ 11920 | consumed samples: 11002880 | elapsed time per iteration (ms): 6309.3 | throughput per GPU (TFLOP/s/GPU): 71.6 | MFU 7.24% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.985193E+00 | loss scale: 1.0 | grad norm: 0.711 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 2:03:33.459933 | finish at 2025-09-10 12:07:21 + [2025-09-10 10:03:53] iteration 10746/ 11920 | consumed samples: 11003904 | elapsed time per iteration (ms): 5633.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.970473E+00 | loss scale: 1.0 | grad norm: 0.685 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:50:13.783138 | finish at 2025-09-10 11:54:07 + [2025-09-10 10:03:58] iteration 10747/ 11920 | consumed samples: 11004928 | elapsed time per iteration (ms): 5638.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.964944E+00 | loss scale: 1.0 | grad norm: 0.650 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:50:13.984242 | finish at 2025-09-10 11:54:12 + [2025-09-10 10:04:04] iteration 10748/ 11920 | consumed samples: 11005952 | elapsed time per iteration (ms): 5635.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.964030E+00 | loss scale: 1.0 | grad norm: 0.708 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:50:05.075034 | finish at 2025-09-10 11:54:09 + [2025-09-10 10:04:10] iteration 10749/ 11920 | consumed samples: 11006976 | elapsed time per iteration (ms): 5638.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.958322E+00 | loss scale: 1.0 | grad norm: 0.693 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:50:02.454816 | finish at 2025-09-10 11:54:12 + [2025-09-10 10:04:15] iteration 10750/ 11920 | consumed samples: 11008000 | elapsed time per iteration (ms): 5644.7 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.966117E+00 | loss scale: 1.0 | grad norm: 1.049 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:50:04.326396 | finish at 2025-09-10 11:54:20 + [2025-09-10 10:04:21] iteration 10751/ 11920 | consumed samples: 11009024 | elapsed time per iteration (ms): 5636.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.953372E+00 | loss scale: 1.0 | grad norm: 0.282 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:49:49.303037 | finish at 2025-09-10 11:54:10 + [2025-09-10 10:04:27] iteration 10752/ 11920 | consumed samples: 11010048 | elapsed time per iteration (ms): 5634.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.951821E+00 | loss scale: 1.0 | grad norm: 0.534 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:49:41.045906 | finish at 2025-09-10 11:54:08 + [2025-09-10 10:04:32] iteration 10753/ 11920 | consumed samples: 11011072 | elapsed time per iteration (ms): 5633.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.946169E+00 | loss scale: 1.0 | grad norm: 0.567 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:49:33.843319 | finish at 2025-09-10 11:54:06 + [2025-09-10 10:04:38] iteration 10754/ 11920 | consumed samples: 11012096 | elapsed time per iteration (ms): 5640.8 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.931784E+00 | loss scale: 1.0 | grad norm: 0.437 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:49:37.218112 | finish at 2025-09-10 11:54:15 + [2025-09-10 10:04:44] iteration 10755/ 11920 | consumed samples: 11013120 | elapsed time per iteration (ms): 5630.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.928216E+00 | loss scale: 1.0 | grad norm: 0.574 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:49:19.803683 | finish at 2025-09-10 11:54:03 + [2025-09-10 10:04:49] iteration 10756/ 11920 | consumed samples: 11014144 | elapsed time per iteration (ms): 5640.7 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.932517E+00 | loss scale: 1.0 | grad norm: 0.580 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:49:25.826537 | finish at 2025-09-10 11:54:15 + [2025-09-10 10:04:55] iteration 10757/ 11920 | consumed samples: 11015168 | elapsed time per iteration (ms): 5639.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.919781E+00 | loss scale: 1.0 | grad norm: 0.668 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:49:18.937197 | finish at 2025-09-10 11:54:14 + [2025-09-10 10:05:00] iteration 10758/ 11920 | consumed samples: 11016192 | elapsed time per iteration (ms): 5634.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.933050E+00 | loss scale: 1.0 | grad norm: 0.700 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:49:06.823877 | finish at 2025-09-10 11:54:07 + [2025-09-10 10:05:06] iteration 10759/ 11920 | consumed samples: 11017216 | elapsed time per iteration (ms): 5630.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.910682E+00 | loss scale: 1.0 | grad norm: 0.477 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:48:57.054049 | finish at 2025-09-10 11:54:03 + [2025-09-10 10:05:12] iteration 10760/ 11920 | consumed samples: 11018240 | elapsed time per iteration (ms): 5655.1 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.921581E+00 | loss scale: 1.0 | grad norm: 0.939 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:49:19.921379 | finish at 2025-09-10 11:54:32 + [2025-09-10 10:05:17] iteration 10761/ 11920 | consumed samples: 11019264 | elapsed time per iteration (ms): 5634.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.925149E+00 | loss scale: 1.0 | grad norm: 0.371 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:48:50.795048 | finish at 2025-09-10 11:54:08 + [2025-09-10 10:05:23] iteration 10762/ 11920 | consumed samples: 11020288 | elapsed time per iteration (ms): 5626.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.906873E+00 | loss scale: 1.0 | grad norm: 0.534 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:48:35.629888 | finish at 2025-09-10 11:53:59 + [2025-09-10 10:05:29] iteration 10763/ 11920 | consumed samples: 11021312 | elapsed time per iteration (ms): 5631.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.918913E+00 | loss scale: 1.0 | grad norm: 0.792 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:48:36.155830 | finish at 2025-09-10 11:54:05 + [2025-09-10 10:05:34] iteration 10764/ 11920 | consumed samples: 11022336 | elapsed time per iteration (ms): 5635.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.907614E+00 | loss scale: 1.0 | grad norm: 0.365 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:48:34.885722 | finish at 2025-09-10 11:54:09 + [2025-09-10 10:05:40] iteration 10765/ 11920 | consumed samples: 11023360 | elapsed time per iteration (ms): 5629.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.892461E+00 | loss scale: 1.0 | grad norm: 0.489 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:48:22.157214 | finish at 2025-09-10 11:54:02 + [2025-09-10 10:05:46] iteration 10766/ 11920 | consumed samples: 11024384 | elapsed time per iteration (ms): 5863.5 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.906929E+00 | loss scale: 1.0 | grad norm: 0.755 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:52:46.449697 | finish at 2025-09-10 11:58:32 + [2025-09-10 10:05:51] iteration 10767/ 11920 | consumed samples: 11025408 | elapsed time per iteration (ms): 5632.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.913477E+00 | loss scale: 1.0 | grad norm: 0.497 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:48:14.611920 | finish at 2025-09-10 11:54:06 + [2025-09-10 10:05:57] iteration 10768/ 11920 | consumed samples: 11026432 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.897708E+00 | loss scale: 1.0 | grad norm: 0.557 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:48:05.000702 | finish at 2025-09-10 11:54:02 + [2025-09-10 10:06:03] iteration 10769/ 11920 | consumed samples: 11027456 | elapsed time per iteration (ms): 5855.2 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.896105E+00 | loss scale: 1.0 | grad norm: 0.743 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:52:19.352000 | finish at 2025-09-10 11:58:22 + [2025-09-10 10:06:09] iteration 10770/ 11920 | consumed samples: 11028480 | elapsed time per iteration (ms): 5633.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.902068E+00 | loss scale: 1.0 | grad norm: 0.358 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:47:58.786957 | finish at 2025-09-10 11:54:07 + [2025-09-10 10:06:14] iteration 10771/ 11920 | consumed samples: 11029504 | elapsed time per iteration (ms): 5629.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.890378E+00 | loss scale: 1.0 | grad norm: 0.273 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:47:47.854899 | finish at 2025-09-10 11:54:02 + [2025-09-10 10:06:20] iteration 10772/ 11920 | consumed samples: 11030528 | elapsed time per iteration (ms): 5629.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.897202E+00 | loss scale: 1.0 | grad norm: 0.390 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:47:42.722282 | finish at 2025-09-10 11:54:03 + [2025-09-10 10:06:25] iteration 10773/ 11920 | consumed samples: 11031552 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.892698E+00 | loss scale: 1.0 | grad norm: 0.730 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:47:31.086050 | finish at 2025-09-10 11:53:57 + [2025-09-10 10:06:31] iteration 10774/ 11920 | consumed samples: 11032576 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.888303E+00 | loss scale: 1.0 | grad norm: 0.894 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:47:26.530332 | finish at 2025-09-10 11:53:58 + [2025-09-10 10:06:37] iteration 10775/ 11920 | consumed samples: 11033600 | elapsed time per iteration (ms): 5627.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.897207E+00 | loss scale: 1.0 | grad norm: 0.695 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:47:23.712782 | finish at 2025-09-10 11:54:00 + [2025-09-10 10:06:42] iteration 10776/ 11920 | consumed samples: 11034624 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.880715E+00 | loss scale: 1.0 | grad norm: 0.520 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:47:13.608698 | finish at 2025-09-10 11:53:56 + [2025-09-10 10:06:48] iteration 10777/ 11920 | consumed samples: 11035648 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.895504E+00 | loss scale: 1.0 | grad norm: 0.543 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:47:10.290642 | finish at 2025-09-10 11:53:58 + [2025-09-10 10:06:54] iteration 10778/ 11920 | consumed samples: 11036672 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.888739E+00 | loss scale: 1.0 | grad norm: 0.504 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:47:00.177493 | finish at 2025-09-10 11:53:54 + [2025-09-10 10:06:59] iteration 10779/ 11920 | consumed samples: 11037696 | elapsed time per iteration (ms): 5623.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.874065E+00 | loss scale: 1.0 | grad norm: 0.384 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:46:55.962045 | finish at 2025-09-10 11:53:55 + [2025-09-10 10:07:05] iteration 10780/ 11920 | consumed samples: 11038720 | elapsed time per iteration (ms): 5880.5 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.865905E+00 | loss scale: 1.0 | grad norm: 0.298 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:51:43.814120 | finish at 2025-09-10 11:58:49 + [2025-09-10 10:07:11] iteration 10781/ 11920 | consumed samples: 11039744 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.864722E+00 | loss scale: 1.0 | grad norm: 0.298 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:46:40.956377 | finish at 2025-09-10 11:53:52 + [2025-09-10 10:07:17] iteration 10782/ 11920 | consumed samples: 11040768 | elapsed time per iteration (ms): 5965.8 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861550E+00 | loss scale: 1.0 | grad norm: 0.294 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:53:09.050066 | finish at 2025-09-10 12:00:26 + [2025-09-10 10:07:22] iteration 10783/ 11920 | consumed samples: 11041792 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867272E+00 | loss scale: 1.0 | grad norm: 0.350 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:46:35.462893 | finish at 2025-09-10 11:53:58 + [2025-09-10 10:07:28] iteration 10784/ 11920 | consumed samples: 11042816 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861147E+00 | loss scale: 1.0 | grad norm: 0.691 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:46:27.390423 | finish at 2025-09-10 11:53:55 + [2025-09-10 10:07:34] iteration 10785/ 11920 | consumed samples: 11043840 | elapsed time per iteration (ms): 5945.0 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.871235E+00 | loss scale: 1.0 | grad norm: 0.453 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:52:27.615515 | finish at 2025-09-10 12:00:01 + [2025-09-10 10:07:39] iteration 10786/ 11920 | consumed samples: 11044864 | elapsed time per iteration (ms): 5618.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.871749E+00 | loss scale: 1.0 | grad norm: 0.442 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:46:11.623675 | finish at 2025-09-10 11:53:51 + [2025-09-10 10:07:45] iteration 10787/ 11920 | consumed samples: 11045888 | elapsed time per iteration (ms): 5890.2 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867998E+00 | loss scale: 1.0 | grad norm: 0.483 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:51:13.617286 | finish at 2025-09-10 11:58:59 + [2025-09-10 10:07:51] iteration 10788/ 11920 | consumed samples: 11046912 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867641E+00 | loss scale: 1.0 | grad norm: 0.415 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:46:07.560995 | finish at 2025-09-10 11:53:59 + [2025-09-10 10:07:57] iteration 10789/ 11920 | consumed samples: 11047936 | elapsed time per iteration (ms): 5839.6 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.875186E+00 | loss scale: 1.0 | grad norm: 0.391 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:50:04.538621 | finish at 2025-09-10 11:58:01 + [2025-09-10 10:08:03] iteration 10790/ 11920 | consumed samples: 11048960 | elapsed time per iteration (ms): 5858.3 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.865375E+00 | loss scale: 1.0 | grad norm: 0.481 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:50:19.912374 | finish at 2025-09-10 11:58:23 + [2025-09-10 10:08:09] iteration 10791/ 11920 | consumed samples: 11049984 | elapsed time per iteration (ms): 5980.1 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.863937E+00 | loss scale: 1.0 | grad norm: 0.411 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:52:31.506696 | finish at 2025-09-10 12:00:40 + [2025-09-10 10:08:14] iteration 10792/ 11920 | consumed samples: 11051008 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.861091E+00 | loss scale: 1.0 | grad norm: 0.324 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:45:45.637648 | finish at 2025-09-10 11:54:00 + [2025-09-10 10:08:20] iteration 10793/ 11920 | consumed samples: 11052032 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867887E+00 | loss scale: 1.0 | grad norm: 0.337 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:45:39.609842 | finish at 2025-09-10 11:54:00 + [2025-09-10 10:08:26] iteration 10794/ 11920 | consumed samples: 11053056 | elapsed time per iteration (ms): 5631.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841607E+00 | loss scale: 1.0 | grad norm: 0.444 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:45:40.618263 | finish at 2025-09-10 11:54:06 + [2025-09-10 10:08:31] iteration 10795/ 11920 | consumed samples: 11054080 | elapsed time per iteration (ms): 5638.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.866782E+00 | loss scale: 1.0 | grad norm: 0.453 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:45:43.384892 | finish at 2025-09-10 11:54:15 + [2025-09-10 10:08:37] iteration 10796/ 11920 | consumed samples: 11055104 | elapsed time per iteration (ms): 5644.0 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859616E+00 | loss scale: 1.0 | grad norm: 0.494 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:45:43.800320 | finish at 2025-09-10 11:54:21 + [2025-09-10 10:08:42] iteration 10797/ 11920 | consumed samples: 11056128 | elapsed time per iteration (ms): 5635.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.873967E+00 | loss scale: 1.0 | grad norm: 0.740 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:45:28.711162 | finish at 2025-09-10 11:54:11 + [2025-09-10 10:08:48] iteration 10798/ 11920 | consumed samples: 11057152 | elapsed time per iteration (ms): 5939.3 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.869151E+00 | loss scale: 1.0 | grad norm: 0.487 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:51:03.926768 | finish at 2025-09-10 11:59:52 + [2025-09-10 10:08:54] iteration 10799/ 11920 | consumed samples: 11058176 | elapsed time per iteration (ms): 5628.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849326E+00 | loss scale: 1.0 | grad norm: 0.356 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:45:09.922924 | finish at 2025-09-10 11:54:04 + [2025-09-10 10:09:00] iteration 10800/ 11920 | consumed samples: 11059200 | elapsed time per iteration (ms): 5628.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.864462E+00 | loss scale: 1.0 | grad norm: 0.289 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:45:03.456688 | finish at 2025-09-10 11:54:03 + [2025-09-10 10:09:05] iteration 10801/ 11920 | consumed samples: 11060224 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.859923E+00 | loss scale: 1.0 | grad norm: 0.296 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:44:53.010634 | finish at 2025-09-10 11:53:58 + [2025-09-10 10:09:11] iteration 10802/ 11920 | consumed samples: 11061248 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.863453E+00 | loss scale: 1.0 | grad norm: 0.386 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:44:45.847782 | finish at 2025-09-10 11:53:57 + [2025-09-10 10:09:17] iteration 10803/ 11920 | consumed samples: 11062272 | elapsed time per iteration (ms): 5888.7 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857611E+00 | loss scale: 1.0 | grad norm: 0.474 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:49:37.705872 | finish at 2025-09-10 11:58:55 + [2025-09-10 10:09:22] iteration 10804/ 11920 | consumed samples: 11063296 | elapsed time per iteration (ms): 5627.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.856663E+00 | loss scale: 1.0 | grad norm: 0.398 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:44:39.777869 | finish at 2025-09-10 11:54:02 + [2025-09-10 10:09:28] iteration 10805/ 11920 | consumed samples: 11064320 | elapsed time per iteration (ms): 5626.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.858005E+00 | loss scale: 1.0 | grad norm: 0.317 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:44:33.765631 | finish at 2025-09-10 11:54:02 + [2025-09-10 10:09:34] iteration 10806/ 11920 | consumed samples: 11065344 | elapsed time per iteration (ms): 5630.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.855082E+00 | loss scale: 1.0 | grad norm: 0.305 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:44:32.132737 | finish at 2025-09-10 11:54:06 + [2025-09-10 10:09:39] iteration 10807/ 11920 | consumed samples: 11066368 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829040E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:44:23.146715 | finish at 2025-09-10 11:54:02 + [2025-09-10 10:09:45] iteration 10808/ 11920 | consumed samples: 11067392 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849103E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:44:10.009354 | finish at 2025-09-10 11:53:55 + [2025-09-10 10:09:51] iteration 10809/ 11920 | consumed samples: 11068416 | elapsed time per iteration (ms): 5861.1 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.850610E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:48:31.651592 | finish at 2025-09-10 11:58:22 + [2025-09-10 10:09:56] iteration 10810/ 11920 | consumed samples: 11069440 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843455E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:44:00.556798 | finish at 2025-09-10 11:53:57 + [2025-09-10 10:10:02] iteration 10811/ 11920 | consumed samples: 11070464 | elapsed time per iteration (ms): 5986.5 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842039E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:50:39.057083 | finish at 2025-09-10 12:00:41 + [2025-09-10 10:10:09] iteration 10812/ 11920 | consumed samples: 11071488 | elapsed time per iteration (ms): 6394.5 | throughput per GPU (TFLOP/s/GPU): 70.6 | MFU 7.14% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.857153E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:58:05.057148 | finish at 2025-09-10 12:08:14 + [2025-09-10 10:10:14] iteration 10813/ 11920 | consumed samples: 11072512 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.840358E+00 | loss scale: 1.0 | grad norm: 0.132 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:43:41.628084 | finish at 2025-09-10 11:53:56 + [2025-09-10 10:10:20] iteration 10814/ 11920 | consumed samples: 11073536 | elapsed time per iteration (ms): 5619.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839953E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:43:35.291903 | finish at 2025-09-10 11:53:55 + [2025-09-10 10:10:26] iteration 10815/ 11920 | consumed samples: 11074560 | elapsed time per iteration (ms): 5617.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.830454E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 10.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:43:27.606295 | finish at 2025-09-10 11:53:53 + [2025-09-10 10:10:31] iteration 10816/ 11920 | consumed samples: 11075584 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841508E+00 | loss scale: 1.0 | grad norm: 0.323 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:43:26.517151 | finish at 2025-09-10 11:53:58 + [2025-09-10 10:10:37] iteration 10817/ 11920 | consumed samples: 11076608 | elapsed time per iteration (ms): 5968.0 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823587E+00 | loss scale: 1.0 | grad norm: 0.275 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:49:42.712081 | finish at 2025-09-10 12:00:20 + [2025-09-10 10:10:43] iteration 10818/ 11920 | consumed samples: 11077632 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814805E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:43:17.812291 | finish at 2025-09-10 11:54:01 + [2025-09-10 10:10:48] iteration 10819/ 11920 | consumed samples: 11078656 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843511E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:43:08.637057 | finish at 2025-09-10 11:53:57 + [2025-09-10 10:10:55] iteration 10820/ 11920 | consumed samples: 11079680 | elapsed time per iteration (ms): 6180.4 | throughput per GPU (TFLOP/s/GPU): 73.1 | MFU 7.39% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836973E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:53:18.493385 | finish at 2025-09-10 12:04:13 + [2025-09-10 10:11:00] iteration 10821/ 11920 | consumed samples: 11080704 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.847336E+00 | loss scale: 1.0 | grad norm: 0.277 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:43:00.775294 | finish at 2025-09-10 11:54:01 + [2025-09-10 10:11:06] iteration 10822/ 11920 | consumed samples: 11081728 | elapsed time per iteration (ms): 5616.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837185E+00 | loss scale: 1.0 | grad norm: 0.248 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:42:46.450655 | finish at 2025-09-10 11:53:52 + [2025-09-10 10:11:12] iteration 10823/ 11920 | consumed samples: 11082752 | elapsed time per iteration (ms): 5618.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.844460E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:42:42.982650 | finish at 2025-09-10 11:53:55 + [2025-09-10 10:11:17] iteration 10824/ 11920 | consumed samples: 11083776 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827068E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:42:43.212923 | finish at 2025-09-10 11:54:00 + [2025-09-10 10:11:23] iteration 10825/ 11920 | consumed samples: 11084800 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832840E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:42:35.355592 | finish at 2025-09-10 11:53:58 + [2025-09-10 10:11:28] iteration 10826/ 11920 | consumed samples: 11085824 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825762E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:42:32.171457 | finish at 2025-09-10 11:54:01 + [2025-09-10 10:11:34] iteration 10827/ 11920 | consumed samples: 11086848 | elapsed time per iteration (ms): 5628.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827063E+00 | loss scale: 1.0 | grad norm: 0.288 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:42:31.408453 | finish at 2025-09-10 11:54:05 + [2025-09-10 10:11:40] iteration 10828/ 11920 | consumed samples: 11087872 | elapsed time per iteration (ms): 5627.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837579E+00 | loss scale: 1.0 | grad norm: 0.338 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:42:25.555243 | finish at 2025-09-10 11:54:05 + [2025-09-10 10:11:45] iteration 10829/ 11920 | consumed samples: 11088896 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829859E+00 | loss scale: 1.0 | grad norm: 0.318 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:42:17.148120 | finish at 2025-09-10 11:54:02 + [2025-09-10 10:11:51] iteration 10830/ 11920 | consumed samples: 11089920 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.833350E+00 | loss scale: 1.0 | grad norm: 0.303 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:42:09.346926 | finish at 2025-09-10 11:54:00 + [2025-09-10 10:11:57] iteration 10831/ 11920 | consumed samples: 11090944 | elapsed time per iteration (ms): 5619.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827472E+00 | loss scale: 1.0 | grad norm: 0.290 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:41:59.716941 | finish at 2025-09-10 11:53:56 + [2025-09-10 10:12:02] iteration 10832/ 11920 | consumed samples: 11091968 | elapsed time per iteration (ms): 5895.6 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836351E+00 | loss scale: 1.0 | grad norm: 0.243 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:46:54.380203 | finish at 2025-09-10 11:58:57 + [2025-09-10 10:12:08] iteration 10833/ 11920 | consumed samples: 11092992 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822599E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:41:51.108535 | finish at 2025-09-10 11:53:59 + [2025-09-10 10:12:14] iteration 10834/ 11920 | consumed samples: 11094016 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820757E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:41:46.255281 | finish at 2025-09-10 11:54:00 + [2025-09-10 10:12:19] iteration 10835/ 11920 | consumed samples: 11095040 | elapsed time per iteration (ms): 5622.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825868E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:41:40.552386 | finish at 2025-09-10 11:54:00 + [2025-09-10 10:12:25] iteration 10836/ 11920 | consumed samples: 11096064 | elapsed time per iteration (ms): 5617.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837774E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:41:29.340610 | finish at 2025-09-10 11:53:54 + [2025-09-10 10:12:31] iteration 10837/ 11920 | consumed samples: 11097088 | elapsed time per iteration (ms): 5965.7 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828185E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:47:40.800329 | finish at 2025-09-10 12:00:12 + [2025-09-10 10:12:36] iteration 10838/ 11920 | consumed samples: 11098112 | elapsed time per iteration (ms): 5630.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819923E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:41:32.341936 | finish at 2025-09-10 11:54:09 + [2025-09-10 10:12:42] iteration 10839/ 11920 | consumed samples: 11099136 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825069E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:41:15.161371 | finish at 2025-09-10 11:53:57 + [2025-09-10 10:12:48] iteration 10840/ 11920 | consumed samples: 11100160 | elapsed time per iteration (ms): 5617.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820068E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:41:07.380295 | finish at 2025-09-10 11:53:55 + [2025-09-10 10:12:53] iteration 10841/ 11920 | consumed samples: 11101184 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817765E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:41:06.070577 | finish at 2025-09-10 11:53:59 + [2025-09-10 10:12:59] iteration 10842/ 11920 | consumed samples: 11102208 | elapsed time per iteration (ms): 6113.9 | throughput per GPU (TFLOP/s/GPU): 73.8 | MFU 7.47% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.833338E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:49:50.783628 | finish at 2025-09-10 12:02:50 + [2025-09-10 10:13:05] iteration 10843/ 11920 | consumed samples: 11103232 | elapsed time per iteration (ms): 5613.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817066E+00 | loss scale: 1.0 | grad norm: 0.263 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:40:45.443050 | finish at 2025-09-10 11:53:51 + [2025-09-10 10:13:11] iteration 10844/ 11920 | consumed samples: 11104256 | elapsed time per iteration (ms): 5618.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829329E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:40:45.977768 | finish at 2025-09-10 11:53:57 + [2025-09-10 10:13:16] iteration 10845/ 11920 | consumed samples: 11105280 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829250E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:40:45.384616 | finish at 2025-09-10 11:54:02 + [2025-09-10 10:13:22] iteration 10846/ 11920 | consumed samples: 11106304 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816660E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:40:39.633996 | finish at 2025-09-10 11:54:02 + [2025-09-10 10:13:28] iteration 10847/ 11920 | consumed samples: 11107328 | elapsed time per iteration (ms): 5619.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824334E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:40:29.279819 | finish at 2025-09-10 11:53:57 + [2025-09-10 10:13:33] iteration 10848/ 11920 | consumed samples: 11108352 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835819E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:40:28.115574 | finish at 2025-09-10 11:54:01 + [2025-09-10 10:13:39] iteration 10849/ 11920 | consumed samples: 11109376 | elapsed time per iteration (ms): 5623.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806369E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:40:22.198173 | finish at 2025-09-10 11:54:01 + [2025-09-10 10:13:44] iteration 10850/ 11920 | consumed samples: 11110400 | elapsed time per iteration (ms): 5616.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821300E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:40:09.903624 | finish at 2025-09-10 11:53:54 + [2025-09-10 10:13:50] iteration 10851/ 11920 | consumed samples: 11111424 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818496E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:40:09.322348 | finish at 2025-09-10 11:53:59 + [2025-09-10 10:13:56] iteration 10852/ 11920 | consumed samples: 11112448 | elapsed time per iteration (ms): 5968.5 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839161E+00 | loss scale: 1.0 | grad norm: 0.132 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:46:14.327846 | finish at 2025-09-10 12:00:10 + [2025-09-10 10:14:02] iteration 10853/ 11920 | consumed samples: 11113472 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.833327E+00 | loss scale: 1.0 | grad norm: 0.132 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:40:04.358381 | finish at 2025-09-10 11:54:06 + [2025-09-10 10:14:07] iteration 10854/ 11920 | consumed samples: 11114496 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822767E+00 | loss scale: 1.0 | grad norm: 0.125 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:39:54.988124 | finish at 2025-09-10 11:54:02 + [2025-09-10 10:14:13] iteration 10855/ 11920 | consumed samples: 11115520 | elapsed time per iteration (ms): 5618.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811513E+00 | loss scale: 1.0 | grad norm: 0.121 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:39:43.779938 | finish at 2025-09-10 11:53:57 + [2025-09-10 10:14:19] iteration 10856/ 11920 | consumed samples: 11116544 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832116E+00 | loss scale: 1.0 | grad norm: 0.111 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:39:39.447002 | finish at 2025-09-10 11:53:58 + [2025-09-10 10:14:24] iteration 10857/ 11920 | consumed samples: 11117568 | elapsed time per iteration (ms): 5619.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831115E+00 | loss scale: 1.0 | grad norm: 0.093 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:39:33.633594 | finish at 2025-09-10 11:53:58 + [2025-09-10 10:14:30] iteration 10858/ 11920 | consumed samples: 11118592 | elapsed time per iteration (ms): 5636.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804119E+00 | loss scale: 1.0 | grad norm: 0.116 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:39:45.783608 | finish at 2025-09-10 11:54:16 + [2025-09-10 10:14:35] iteration 10859/ 11920 | consumed samples: 11119616 | elapsed time per iteration (ms): 5629.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825634E+00 | loss scale: 1.0 | grad norm: 0.117 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:39:33.132132 | finish at 2025-09-10 11:54:09 + [2025-09-10 10:14:41] iteration 10860/ 11920 | consumed samples: 11120640 | elapsed time per iteration (ms): 5856.3 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810757E+00 | loss scale: 1.0 | grad norm: 0.110 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:43:27.725887 | finish at 2025-09-10 11:58:09 + [2025-09-10 10:14:47] iteration 10861/ 11920 | consumed samples: 11121664 | elapsed time per iteration (ms): 5618.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.842371E+00 | loss scale: 1.0 | grad norm: 0.098 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:39:10.139198 | finish at 2025-09-10 11:53:57 + [2025-09-10 10:14:52] iteration 10862/ 11920 | consumed samples: 11122688 | elapsed time per iteration (ms): 5615.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819293E+00 | loss scale: 1.0 | grad norm: 0.118 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:39:00.801935 | finish at 2025-09-10 11:53:53 + [2025-09-10 10:14:58] iteration 10863/ 11920 | consumed samples: 11123712 | elapsed time per iteration (ms): 5617.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.836904E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:38:58.052146 | finish at 2025-09-10 11:53:56 + [2025-09-10 10:15:04] iteration 10864/ 11920 | consumed samples: 11124736 | elapsed time per iteration (ms): 5617.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817379E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:38:52.037270 | finish at 2025-09-10 11:53:56 + [2025-09-10 10:15:09] iteration 10865/ 11920 | consumed samples: 11125760 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808181E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:38:52.835124 | finish at 2025-09-10 11:54:02 + [2025-09-10 10:15:15] iteration 10866/ 11920 | consumed samples: 11126784 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812203E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:38:46.361961 | finish at 2025-09-10 11:54:01 + [2025-09-10 10:15:21] iteration 10867/ 11920 | consumed samples: 11127808 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813490E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:38:42.202123 | finish at 2025-09-10 11:54:03 + [2025-09-10 10:15:27] iteration 10868/ 11920 | consumed samples: 11128832 | elapsed time per iteration (ms): 5956.0 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815353E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:44:25.702312 | finish at 2025-09-10 11:59:52 + [2025-09-10 10:15:32] iteration 10869/ 11920 | consumed samples: 11129856 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824983E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:38:29.986895 | finish at 2025-09-10 11:54:02 + [2025-09-10 10:15:38] iteration 10870/ 11920 | consumed samples: 11130880 | elapsed time per iteration (ms): 5994.5 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808698E+00 | loss scale: 1.0 | grad norm: 0.137 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:44:54.185615 | finish at 2025-09-10 12:00:32 + [2025-09-10 10:15:44] iteration 10871/ 11920 | consumed samples: 11131904 | elapsed time per iteration (ms): 5628.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819648E+00 | loss scale: 1.0 | grad norm: 0.117 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:38:24.368513 | finish at 2025-09-10 11:54:08 + [2025-09-10 10:15:49] iteration 10872/ 11920 | consumed samples: 11132928 | elapsed time per iteration (ms): 5629.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821659E+00 | loss scale: 1.0 | grad norm: 0.123 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:38:19.684675 | finish at 2025-09-10 11:54:09 + [2025-09-10 10:15:55] iteration 10873/ 11920 | consumed samples: 11133952 | elapsed time per iteration (ms): 5618.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813463E+00 | loss scale: 1.0 | grad norm: 0.124 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:38:02.881275 | finish at 2025-09-10 11:53:58 + [2025-09-10 10:16:01] iteration 10874/ 11920 | consumed samples: 11134976 | elapsed time per iteration (ms): 6045.1 | throughput per GPU (TFLOP/s/GPU): 74.7 | MFU 7.55% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800080E+00 | loss scale: 1.0 | grad norm: 0.123 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:45:23.220459 | finish at 2025-09-10 12:01:24 + [2025-09-10 10:16:07] iteration 10875/ 11920 | consumed samples: 11136000 | elapsed time per iteration (ms): 5618.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827896E+00 | loss scale: 1.0 | grad norm: 0.134 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:37:50.937097 | finish at 2025-09-10 11:53:58 + [2025-09-10 10:16:13] iteration 10876/ 11920 | consumed samples: 11137024 | elapsed time per iteration (ms): 5855.2 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808198E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:41:52.784051 | finish at 2025-09-10 11:58:05 + [2025-09-10 10:16:18] iteration 10877/ 11920 | consumed samples: 11138048 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813320E+00 | loss scale: 1.0 | grad norm: 0.142 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:37:42.368840 | finish at 2025-09-10 11:54:01 + [2025-09-10 10:16:24] iteration 10878/ 11920 | consumed samples: 11139072 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822299E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:37:37.791079 | finish at 2025-09-10 11:54:02 + [2025-09-10 10:16:29] iteration 10879/ 11920 | consumed samples: 11140096 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821330E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:37:34.277060 | finish at 2025-09-10 11:54:04 + [2025-09-10 10:16:35] iteration 10880/ 11920 | consumed samples: 11141120 | elapsed time per iteration (ms): 5630.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816277E+00 | loss scale: 1.0 | grad norm: 0.127 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:37:36.098213 | finish at 2025-09-10 11:54:11 + [2025-09-10 10:16:41] iteration 10881/ 11920 | consumed samples: 11142144 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813251E+00 | loss scale: 1.0 | grad norm: 0.122 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:37:19.248251 | finish at 2025-09-10 11:54:00 + [2025-09-10 10:16:47] iteration 10882/ 11920 | consumed samples: 11143168 | elapsed time per iteration (ms): 5849.0 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809556E+00 | loss scale: 1.0 | grad norm: 0.128 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:41:11.213718 | finish at 2025-09-10 11:57:58 + [2025-09-10 10:16:52] iteration 10883/ 11920 | consumed samples: 11144192 | elapsed time per iteration (ms): 5618.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815616E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:37:06.293015 | finish at 2025-09-10 11:53:58 + [2025-09-10 10:16:58] iteration 10884/ 11920 | consumed samples: 11145216 | elapsed time per iteration (ms): 6243.9 | throughput per GPU (TFLOP/s/GPU): 72.3 | MFU 7.31% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824983E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:47:48.685650 | finish at 2025-09-10 12:04:47 + [2025-09-10 10:17:04] iteration 10885/ 11920 | consumed samples: 11146240 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799818E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:36:58.533826 | finish at 2025-09-10 11:54:03 + [2025-09-10 10:17:10] iteration 10886/ 11920 | consumed samples: 11147264 | elapsed time per iteration (ms): 5619.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818906E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:36:50.220250 | finish at 2025-09-10 11:54:00 + [2025-09-10 10:17:16] iteration 10887/ 11920 | consumed samples: 11148288 | elapsed time per iteration (ms): 5932.0 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815719E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:42:07.795325 | finish at 2025-09-10 11:59:23 + [2025-09-10 10:17:21] iteration 10888/ 11920 | consumed samples: 11149312 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820169E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:36:44.849173 | finish at 2025-09-10 11:54:06 + [2025-09-10 10:17:27] iteration 10889/ 11920 | consumed samples: 11150336 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827704E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:36:37.714556 | finish at 2025-09-10 11:54:05 + [2025-09-10 10:17:33] iteration 10890/ 11920 | consumed samples: 11151360 | elapsed time per iteration (ms): 5936.1 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817269E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:41:54.233840 | finish at 2025-09-10 11:59:27 + [2025-09-10 10:17:38] iteration 10891/ 11920 | consumed samples: 11152384 | elapsed time per iteration (ms): 5630.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812201E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:36:33.288272 | finish at 2025-09-10 11:54:12 + [2025-09-10 10:17:44] iteration 10892/ 11920 | consumed samples: 11153408 | elapsed time per iteration (ms): 5640.5 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803001E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:36:38.444610 | finish at 2025-09-10 11:54:22 + [2025-09-10 10:17:50] iteration 10893/ 11920 | consumed samples: 11154432 | elapsed time per iteration (ms): 5638.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818058E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:36:30.495353 | finish at 2025-09-10 11:54:20 + [2025-09-10 10:17:56] iteration 10894/ 11920 | consumed samples: 11155456 | elapsed time per iteration (ms): 5953.0 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819939E+00 | loss scale: 1.0 | grad norm: 0.286 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:41:47.790301 | finish at 2025-09-10 11:59:43 + [2025-09-10 10:18:01] iteration 10895/ 11920 | consumed samples: 11156480 | elapsed time per iteration (ms): 5630.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813788E+00 | loss scale: 1.0 | grad norm: 0.302 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:36:10.986187 | finish at 2025-09-10 11:54:12 + [2025-09-10 10:18:07] iteration 10896/ 11920 | consumed samples: 11157504 | elapsed time per iteration (ms): 5638.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827262E+00 | loss scale: 1.0 | grad norm: 0.245 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:36:14.257812 | finish at 2025-09-10 11:54:21 + [2025-09-10 10:18:13] iteration 10897/ 11920 | consumed samples: 11158528 | elapsed time per iteration (ms): 5631.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809509E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:36:01.130116 | finish at 2025-09-10 11:54:14 + [2025-09-10 10:18:18] iteration 10898/ 11920 | consumed samples: 11159552 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818530E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:35:45.436660 | finish at 2025-09-10 11:54:04 + [2025-09-10 10:18:24] iteration 10899/ 11920 | consumed samples: 11160576 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809728E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:35:40.812216 | finish at 2025-09-10 11:54:05 + [2025-09-10 10:18:29] iteration 10900/ 11920 | consumed samples: 11161600 | elapsed time per iteration (ms): 5634.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815965E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:35:47.587638 | finish at 2025-09-10 11:54:17 + [2025-09-10 10:18:35] iteration 10901/ 11920 | consumed samples: 11162624 | elapsed time per iteration (ms): 5650.3 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802238E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:35:57.663501 | finish at 2025-09-10 11:54:33 + [2025-09-10 10:18:41] iteration 10902/ 11920 | consumed samples: 11163648 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824144E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:35:26.149275 | finish at 2025-09-10 11:54:07 + [2025-09-10 10:18:47] iteration 10903/ 11920 | consumed samples: 11164672 | elapsed time per iteration (ms): 5853.0 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812091E+00 | loss scale: 1.0 | grad norm: 0.133 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:39:12.527596 | finish at 2025-09-10 11:57:59 + [2025-09-10 10:18:52] iteration 10904/ 11920 | consumed samples: 11165696 | elapsed time per iteration (ms): 5868.6 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809409E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:39:22.475447 | finish at 2025-09-10 11:58:15 + [2025-09-10 10:18:59] iteration 10905/ 11920 | consumed samples: 11166720 | elapsed time per iteration (ms): 6139.1 | throughput per GPU (TFLOP/s/GPU): 73.5 | MFU 7.44% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.792358E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:43:51.216341 | finish at 2025-09-10 12:02:50 + [2025-09-10 10:19:04] iteration 10906/ 11920 | consumed samples: 11167744 | elapsed time per iteration (ms): 5616.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818259E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:34:55.182394 | finish at 2025-09-10 11:53:59 + [2025-09-10 10:19:10] iteration 10907/ 11920 | consumed samples: 11168768 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802166E+00 | loss scale: 1.0 | grad norm: 0.131 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:34:56.815731 | finish at 2025-09-10 11:54:07 + [2025-09-10 10:19:15] iteration 10908/ 11920 | consumed samples: 11169792 | elapsed time per iteration (ms): 5618.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817340E+00 | loss scale: 1.0 | grad norm: 0.127 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:34:45.438470 | finish at 2025-09-10 11:54:01 + [2025-09-10 10:19:21] iteration 10909/ 11920 | consumed samples: 11170816 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815935E+00 | loss scale: 1.0 | grad norm: 0.130 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:34:43.044856 | finish at 2025-09-10 11:54:04 + [2025-09-10 10:19:27] iteration 10910/ 11920 | consumed samples: 11171840 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801157E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:34:40.633786 | finish at 2025-09-10 11:54:07 + [2025-09-10 10:19:32] iteration 10911/ 11920 | consumed samples: 11172864 | elapsed time per iteration (ms): 5618.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813291E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:34:29.092475 | finish at 2025-09-10 11:54:01 + [2025-09-10 10:19:38] iteration 10912/ 11920 | consumed samples: 11173888 | elapsed time per iteration (ms): 5633.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822848E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:34:38.065819 | finish at 2025-09-10 11:54:16 + [2025-09-10 10:19:44] iteration 10913/ 11920 | consumed samples: 11174912 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824774E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:34:24.041759 | finish at 2025-09-10 11:54:08 + [2025-09-10 10:19:49] iteration 10914/ 11920 | consumed samples: 11175936 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807638E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:34:14.577825 | finish at 2025-09-10 11:54:04 + [2025-09-10 10:19:55] iteration 10915/ 11920 | consumed samples: 11176960 | elapsed time per iteration (ms): 5615.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816277E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:34:03.894718 | finish at 2025-09-10 11:53:59 + [2025-09-10 10:20:00] iteration 10916/ 11920 | consumed samples: 11177984 | elapsed time per iteration (ms): 5617.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813224E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:34:00.203934 | finish at 2025-09-10 11:54:01 + [2025-09-10 10:20:06] iteration 10917/ 11920 | consumed samples: 11179008 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805048E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:34:01.844152 | finish at 2025-09-10 11:54:08 + [2025-09-10 10:20:12] iteration 10918/ 11920 | consumed samples: 11180032 | elapsed time per iteration (ms): 5618.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.797113E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:33:49.250364 | finish at 2025-09-10 11:54:01 + [2025-09-10 10:20:17] iteration 10919/ 11920 | consumed samples: 11181056 | elapsed time per iteration (ms): 5865.3 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803141E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:37:51.135169 | finish at 2025-09-10 11:58:09 + [2025-09-10 10:20:23] iteration 10920/ 11920 | consumed samples: 11182080 | elapsed time per iteration (ms): 5628.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824492E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:33:48.467560 | finish at 2025-09-10 11:54:12 + [2025-09-10 10:20:29] iteration 10921/ 11920 | consumed samples: 11183104 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820328E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:33:40.700711 | finish at 2025-09-10 11:54:09 + [2025-09-10 10:20:35] iteration 10922/ 11920 | consumed samples: 11184128 | elapsed time per iteration (ms): 5870.7 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801517E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:37:38.958721 | finish at 2025-09-10 11:58:14 + [2025-09-10 10:20:40] iteration 10923/ 11920 | consumed samples: 11185152 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810372E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:33:29.251476 | finish at 2025-09-10 11:54:09 + [2025-09-10 10:20:46] iteration 10924/ 11920 | consumed samples: 11186176 | elapsed time per iteration (ms): 5618.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804742E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:33:16.258472 | finish at 2025-09-10 11:54:02 + [2025-09-10 10:20:52] iteration 10925/ 11920 | consumed samples: 11187200 | elapsed time per iteration (ms): 5992.6 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811476E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:39:22.594501 | finish at 2025-09-10 12:00:14 + [2025-09-10 10:20:58] iteration 10926/ 11920 | consumed samples: 11188224 | elapsed time per iteration (ms): 5921.4 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814084E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:38:05.883993 | finish at 2025-09-10 11:59:04 + [2025-09-10 10:21:03] iteration 10927/ 11920 | consumed samples: 11189248 | elapsed time per iteration (ms): 5619.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817122E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:32:59.820372 | finish at 2025-09-10 11:54:03 + [2025-09-10 10:21:09] iteration 10928/ 11920 | consumed samples: 11190272 | elapsed time per iteration (ms): 5932.0 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816095E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:38:04.544395 | finish at 2025-09-10 11:59:14 + [2025-09-10 10:21:15] iteration 10929/ 11920 | consumed samples: 11191296 | elapsed time per iteration (ms): 5957.5 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818922E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:38:23.848694 | finish at 2025-09-10 11:59:39 + [2025-09-10 10:21:21] iteration 10930/ 11920 | consumed samples: 11192320 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823836E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:32:49.579189 | finish at 2025-09-10 11:54:10 + [2025-09-10 10:21:27] iteration 10931/ 11920 | consumed samples: 11193344 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803214E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:32:47.351407 | finish at 2025-09-10 11:54:14 + [2025-09-10 10:21:32] iteration 10932/ 11920 | consumed samples: 11194368 | elapsed time per iteration (ms): 5629.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807688E+00 | loss scale: 1.0 | grad norm: 0.127 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:32:41.644164 | finish at 2025-09-10 11:54:14 + [2025-09-10 10:21:38] iteration 10933/ 11920 | consumed samples: 11195392 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.796510E+00 | loss scale: 1.0 | grad norm: 0.124 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:32:36.069799 | finish at 2025-09-10 11:54:14 + [2025-09-10 10:21:43] iteration 10934/ 11920 | consumed samples: 11196416 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804773E+00 | loss scale: 1.0 | grad norm: 0.129 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:32:24.941071 | finish at 2025-09-10 11:54:08 + [2025-09-10 10:21:49] iteration 10935/ 11920 | consumed samples: 11197440 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811456E+00 | loss scale: 1.0 | grad norm: 0.136 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:32:15.167735 | finish at 2025-09-10 11:54:04 + [2025-09-10 10:21:55] iteration 10936/ 11920 | consumed samples: 11198464 | elapsed time per iteration (ms): 5930.6 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821388E+00 | loss scale: 1.0 | grad norm: 0.127 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:37:15.755373 | finish at 2025-09-10 11:59:11 + [2025-09-10 10:22:01] iteration 10937/ 11920 | consumed samples: 11199488 | elapsed time per iteration (ms): 5615.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823297E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:31:59.929603 | finish at 2025-09-10 11:54:01 + [2025-09-10 10:22:06] iteration 10938/ 11920 | consumed samples: 11200512 | elapsed time per iteration (ms): 5619.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814566E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:31:58.491272 | finish at 2025-09-10 11:54:05 + [2025-09-10 10:22:12] iteration 10939/ 11920 | consumed samples: 11201536 | elapsed time per iteration (ms): 5630.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815743E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:32:03.910703 | finish at 2025-09-10 11:54:16 + [2025-09-10 10:22:17] iteration 10940/ 11920 | consumed samples: 11202560 | elapsed time per iteration (ms): 5632.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815464E+00 | loss scale: 1.0 | grad norm: 0.271 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:32:00.207419 | finish at 2025-09-10 11:54:18 + [2025-09-10 10:22:23] iteration 10941/ 11920 | consumed samples: 11203584 | elapsed time per iteration (ms): 5642.9 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824448E+00 | loss scale: 1.0 | grad norm: 0.595 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:32:04.359410 | finish at 2025-09-10 11:54:27 + [2025-09-10 10:22:29] iteration 10942/ 11920 | consumed samples: 11204608 | elapsed time per iteration (ms): 6051.0 | throughput per GPU (TFLOP/s/GPU): 74.6 | MFU 7.54% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843713E+00 | loss scale: 1.0 | grad norm: 0.432 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:38:37.888142 | finish at 2025-09-10 12:01:07 + [2025-09-10 10:22:35] iteration 10943/ 11920 | consumed samples: 11205632 | elapsed time per iteration (ms): 5619.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.910286E+00 | loss scale: 1.0 | grad norm: 6.444 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:31:29.759698 | finish at 2025-09-10 11:54:05 + [2025-09-10 10:22:40] iteration 10944/ 11920 | consumed samples: 11206656 | elapsed time per iteration (ms): 5654.9 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.975618E+00 | loss scale: 1.0 | grad norm: 5.727 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:31:59.167580 | finish at 2025-09-10 11:54:40 + [2025-09-10 10:22:46] iteration 10945/ 11920 | consumed samples: 11207680 | elapsed time per iteration (ms): 5656.3 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.985825E+00 | loss scale: 1.0 | grad norm: 1.678 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:31:54.914882 | finish at 2025-09-10 11:54:41 + [2025-09-10 10:22:52] iteration 10946/ 11920 | consumed samples: 11208704 | elapsed time per iteration (ms): 5666.6 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.921870E+00 | loss scale: 1.0 | grad norm: 0.890 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:31:59.301364 | finish at 2025-09-10 11:54:51 + [2025-09-10 10:22:57] iteration 10947/ 11920 | consumed samples: 11209728 | elapsed time per iteration (ms): 5647.9 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.876450E+00 | loss scale: 1.0 | grad norm: 0.403 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:31:35.389867 | finish at 2025-09-10 11:54:33 + [2025-09-10 10:23:03] iteration 10948/ 11920 | consumed samples: 11210752 | elapsed time per iteration (ms): 5879.5 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.880061E+00 | loss scale: 1.0 | grad norm: 0.487 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:35:14.898479 | finish at 2025-09-10 11:58:18 + [2025-09-10 10:23:09] iteration 10949/ 11920 | consumed samples: 11211776 | elapsed time per iteration (ms): 5939.8 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.151790E+00 | loss scale: 1.0 | grad norm: 6.956 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:36:07.518969 | finish at 2025-09-10 11:59:17 + [2025-09-10 10:23:15] iteration 10950/ 11920 | consumed samples: 11212800 | elapsed time per iteration (ms): 5642.7 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.993184E+00 | loss scale: 1.0 | grad norm: 1.318 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:31:13.420346 | finish at 2025-09-10 11:54:28 + [2025-09-10 10:23:21] iteration 10951/ 11920 | consumed samples: 11213824 | elapsed time per iteration (ms): 5648.5 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.066765E+00 | loss scale: 1.0 | grad norm: 2.536 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:31:13.414487 | finish at 2025-09-10 11:54:34 + [2025-09-10 10:23:26] iteration 10952/ 11920 | consumed samples: 11214848 | elapsed time per iteration (ms): 5658.1 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.963778E+00 | loss scale: 1.0 | grad norm: 0.796 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:31:17.068388 | finish at 2025-09-10 11:54:43 + [2025-09-10 10:23:32] iteration 10953/ 11920 | consumed samples: 11215872 | elapsed time per iteration (ms): 5956.0 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.935160E+00 | loss scale: 1.0 | grad norm: 0.387 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:35:59.410818 | finish at 2025-09-10 11:59:32 + [2025-09-10 10:23:38] iteration 10954/ 11920 | consumed samples: 11216896 | elapsed time per iteration (ms): 5661.6 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.916836E+00 | loss scale: 1.0 | grad norm: 0.541 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:31:09.060798 | finish at 2025-09-10 11:54:47 + [2025-09-10 10:23:43] iteration 10955/ 11920 | consumed samples: 11217920 | elapsed time per iteration (ms): 5659.0 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.912223E+00 | loss scale: 1.0 | grad norm: 0.618 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:31:00.910075 | finish at 2025-09-10 11:54:44 + [2025-09-10 10:23:49] iteration 10956/ 11920 | consumed samples: 11218944 | elapsed time per iteration (ms): 5651.9 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.903821E+00 | loss scale: 1.0 | grad norm: 0.468 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:30:48.402462 | finish at 2025-09-10 11:54:38 + [2025-09-10 10:23:55] iteration 10957/ 11920 | consumed samples: 11219968 | elapsed time per iteration (ms): 5978.0 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.905400E+00 | loss scale: 1.0 | grad norm: 0.756 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:35:56.816454 | finish at 2025-09-10 11:59:52 + [2025-09-10 10:24:01] iteration 10958/ 11920 | consumed samples: 11220992 | elapsed time per iteration (ms): 5649.4 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.898694E+00 | loss scale: 1.0 | grad norm: 0.509 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:30:34.725778 | finish at 2025-09-10 11:54:35 + [2025-09-10 10:24:06] iteration 10959/ 11920 | consumed samples: 11222016 | elapsed time per iteration (ms): 5648.2 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.911405E+00 | loss scale: 1.0 | grad norm: 0.886 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:30:27.873035 | finish at 2025-09-10 11:54:34 + [2025-09-10 10:24:12] iteration 10960/ 11920 | consumed samples: 11223040 | elapsed time per iteration (ms): 5647.3 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.888961E+00 | loss scale: 1.0 | grad norm: 0.289 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:30:21.363831 | finish at 2025-09-10 11:54:33 + [2025-09-10 10:24:18] iteration 10961/ 11920 | consumed samples: 11224064 | elapsed time per iteration (ms): 5659.7 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.898708E+00 | loss scale: 1.0 | grad norm: 0.351 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:30:27.677828 | finish at 2025-09-10 11:54:45 + [2025-09-10 10:24:23] iteration 10962/ 11920 | consumed samples: 11225088 | elapsed time per iteration (ms): 5651.2 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.893963E+00 | loss scale: 1.0 | grad norm: 0.490 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:30:13.858562 | finish at 2025-09-10 11:54:37 + [2025-09-10 10:24:29] iteration 10963/ 11920 | consumed samples: 11226112 | elapsed time per iteration (ms): 5652.0 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.890879E+00 | loss scale: 1.0 | grad norm: 0.426 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:30:08.925621 | finish at 2025-09-10 11:54:38 + [2025-09-10 10:24:35] iteration 10964/ 11920 | consumed samples: 11227136 | elapsed time per iteration (ms): 5649.4 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.880862E+00 | loss scale: 1.0 | grad norm: 0.288 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:30:00.820698 | finish at 2025-09-10 11:54:35 + [2025-09-10 10:24:40] iteration 10965/ 11920 | consumed samples: 11228160 | elapsed time per iteration (ms): 5651.8 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.908336E+00 | loss scale: 1.0 | grad norm: 0.826 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:29:57.429531 | finish at 2025-09-10 11:54:38 + [2025-09-10 10:24:46] iteration 10966/ 11920 | consumed samples: 11229184 | elapsed time per iteration (ms): 5659.6 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.995663E+00 | loss scale: 1.0 | grad norm: 1.183 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:29:59.226575 | finish at 2025-09-10 11:54:45 + [2025-09-10 10:24:52] iteration 10967/ 11920 | consumed samples: 11230208 | elapsed time per iteration (ms): 5662.3 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.923946E+00 | loss scale: 1.0 | grad norm: 0.740 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:29:56.168142 | finish at 2025-09-10 11:54:48 + [2025-09-10 10:24:57] iteration 10968/ 11920 | consumed samples: 11231232 | elapsed time per iteration (ms): 5657.5 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.967942E+00 | loss scale: 1.0 | grad norm: 0.975 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:29:45.935261 | finish at 2025-09-10 11:54:43 + [2025-09-10 10:25:03] iteration 10969/ 11920 | consumed samples: 11232256 | elapsed time per iteration (ms): 5671.2 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.122695E+00 | loss scale: 1.0 | grad norm: 2.501 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:29:53.319851 | finish at 2025-09-10 11:54:56 + [2025-09-10 10:25:09] iteration 10970/ 11920 | consumed samples: 11233280 | elapsed time per iteration (ms): 5685.6 | throughput per GPU (TFLOP/s/GPU): 79.4 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 3.083799E+00 | loss scale: 1.0 | grad norm: 1.960 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:30:01.352656 | finish at 2025-09-10 11:55:10 + [2025-09-10 10:25:14] iteration 10971/ 11920 | consumed samples: 11234304 | elapsed time per iteration (ms): 5653.9 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.967563E+00 | loss scale: 1.0 | grad norm: 0.455 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:29:25.570697 | finish at 2025-09-10 11:54:40 + [2025-09-10 10:25:20] iteration 10972/ 11920 | consumed samples: 11235328 | elapsed time per iteration (ms): 5681.5 | throughput per GPU (TFLOP/s/GPU): 79.5 | MFU 8.03% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.972351E+00 | loss scale: 1.0 | grad norm: 0.772 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:29:46.079816 | finish at 2025-09-10 11:55:06 + [2025-09-10 10:25:26] iteration 10973/ 11920 | consumed samples: 11236352 | elapsed time per iteration (ms): 6044.2 | throughput per GPU (TFLOP/s/GPU): 74.7 | MFU 7.55% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.960436E+00 | loss scale: 1.0 | grad norm: 0.573 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:35:23.816480 | finish at 2025-09-10 12:00:50 + [2025-09-10 10:25:32] iteration 10974/ 11920 | consumed samples: 11237376 | elapsed time per iteration (ms): 5656.4 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.953953E+00 | loss scale: 1.0 | grad norm: 0.402 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:29:10.939707 | finish at 2025-09-10 11:54:43 + [2025-09-10 10:25:37] iteration 10975/ 11920 | consumed samples: 11238400 | elapsed time per iteration (ms): 5657.7 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.935735E+00 | loss scale: 1.0 | grad norm: 0.319 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:29:06.480145 | finish at 2025-09-10 11:54:44 + [2025-09-10 10:25:43] iteration 10976/ 11920 | consumed samples: 11239424 | elapsed time per iteration (ms): 5663.3 | throughput per GPU (TFLOP/s/GPU): 79.7 | MFU 8.06% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.914296E+00 | loss scale: 1.0 | grad norm: 0.635 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:29:06.148258 | finish at 2025-09-10 11:54:49 + [2025-09-10 10:25:49] iteration 10977/ 11920 | consumed samples: 11240448 | elapsed time per iteration (ms): 5884.2 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.932134E+00 | loss scale: 1.0 | grad norm: 0.783 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:32:28.798218 | finish at 2025-09-10 11:58:18 + [2025-09-10 10:25:55] iteration 10978/ 11920 | consumed samples: 11241472 | elapsed time per iteration (ms): 5649.7 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.921516E+00 | loss scale: 1.0 | grad norm: 0.308 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:28:42.032378 | finish at 2025-09-10 11:54:37 + [2025-09-10 10:26:01] iteration 10979/ 11920 | consumed samples: 11242496 | elapsed time per iteration (ms): 5976.9 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.896993E+00 | loss scale: 1.0 | grad norm: 0.443 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:33:44.279821 | finish at 2025-09-10 11:59:45 + [2025-09-10 10:26:06] iteration 10980/ 11920 | consumed samples: 11243520 | elapsed time per iteration (ms): 5644.5 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.912566E+00 | loss scale: 1.0 | grad norm: 0.473 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:28:25.829568 | finish at 2025-09-10 11:54:32 + [2025-09-10 10:26:12] iteration 10981/ 11920 | consumed samples: 11244544 | elapsed time per iteration (ms): 6022.7 | throughput per GPU (TFLOP/s/GPU): 75.0 | MFU 7.58% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.904803E+00 | loss scale: 1.0 | grad norm: 0.448 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:34:15.305740 | finish at 2025-09-10 12:00:27 + [2025-09-10 10:26:18] iteration 10982/ 11920 | consumed samples: 11245568 | elapsed time per iteration (ms): 5636.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.904876E+00 | loss scale: 1.0 | grad norm: 0.620 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:28:07.450170 | finish at 2025-09-10 11:54:25 + [2025-09-10 10:26:24] iteration 10983/ 11920 | consumed samples: 11246592 | elapsed time per iteration (ms): 5978.0 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.916191E+00 | loss scale: 1.0 | grad norm: 0.756 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:33:21.416983 | finish at 2025-09-10 11:59:45 + [2025-09-10 10:26:29] iteration 10984/ 11920 | consumed samples: 11247616 | elapsed time per iteration (ms): 5672.8 | throughput per GPU (TFLOP/s/GPU): 79.6 | MFU 8.05% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.900467E+00 | loss scale: 1.0 | grad norm: 0.315 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:28:29.727694 | finish at 2025-09-10 11:54:59 + [2025-09-10 10:26:35] iteration 10985/ 11920 | consumed samples: 11248640 | elapsed time per iteration (ms): 5656.5 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.889250E+00 | loss scale: 1.0 | grad norm: 0.269 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:28:08.811767 | finish at 2025-09-10 11:54:44 + [2025-09-10 10:26:41] iteration 10986/ 11920 | consumed samples: 11249664 | elapsed time per iteration (ms): 5653.7 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.880562E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:28:00.566372 | finish at 2025-09-10 11:54:41 + [2025-09-10 10:26:46] iteration 10987/ 11920 | consumed samples: 11250688 | elapsed time per iteration (ms): 5644.2 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.885113E+00 | loss scale: 1.0 | grad norm: 0.254 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:27:46.028003 | finish at 2025-09-10 11:54:32 + [2025-09-10 10:26:52] iteration 10988/ 11920 | consumed samples: 11251712 | elapsed time per iteration (ms): 5637.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862794E+00 | loss scale: 1.0 | grad norm: 0.412 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:27:34.390916 | finish at 2025-09-10 11:54:26 + [2025-09-10 10:26:58] iteration 10989/ 11920 | consumed samples: 11252736 | elapsed time per iteration (ms): 5641.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.882376E+00 | loss scale: 1.0 | grad norm: 0.664 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:27:32.028735 | finish at 2025-09-10 11:54:30 + [2025-09-10 10:27:04] iteration 10990/ 11920 | consumed samples: 11253760 | elapsed time per iteration (ms): 5992.5 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.870754E+00 | loss scale: 1.0 | grad norm: 0.295 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:32:53.015528 | finish at 2025-09-10 11:59:57 + [2025-09-10 10:27:10] iteration 10991/ 11920 | consumed samples: 11254784 | elapsed time per iteration (ms): 5947.7 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.887806E+00 | loss scale: 1.0 | grad norm: 0.307 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:32:05.369245 | finish at 2025-09-10 11:59:15 + [2025-09-10 10:27:15] iteration 10992/ 11920 | consumed samples: 11255808 | elapsed time per iteration (ms): 5630.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867756E+00 | loss scale: 1.0 | grad norm: 0.558 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:27:04.786133 | finish at 2025-09-10 11:54:20 + [2025-09-10 10:27:21] iteration 10993/ 11920 | consumed samples: 11256832 | elapsed time per iteration (ms): 5638.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.862866E+00 | loss scale: 1.0 | grad norm: 0.483 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:27:06.850800 | finish at 2025-09-10 11:54:28 + [2025-09-10 10:27:27] iteration 10994/ 11920 | consumed samples: 11257856 | elapsed time per iteration (ms): 5864.4 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.867540E+00 | loss scale: 1.0 | grad norm: 0.292 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:30:30.398551 | finish at 2025-09-10 11:57:57 + [2025-09-10 10:27:33] iteration 10995/ 11920 | consumed samples: 11258880 | elapsed time per iteration (ms): 5844.4 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.853977E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:30:06.026268 | finish at 2025-09-10 11:57:39 + [2025-09-10 10:27:38] iteration 10996/ 11920 | consumed samples: 11259904 | elapsed time per iteration (ms): 5645.0 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.866959E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:26:55.936804 | finish at 2025-09-10 11:54:34 + [2025-09-10 10:27:44] iteration 10997/ 11920 | consumed samples: 11260928 | elapsed time per iteration (ms): 5640.2 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.843194E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:26:45.930915 | finish at 2025-09-10 11:54:30 + [2025-09-10 10:27:50] iteration 10998/ 11920 | consumed samples: 11261952 | elapsed time per iteration (ms): 5642.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845404E+00 | loss scale: 1.0 | grad norm: 0.301 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:26:42.210391 | finish at 2025-09-10 11:54:32 + [2025-09-10 10:27:55] iteration 10999/ 11920 | consumed samples: 11262976 | elapsed time per iteration (ms): 5637.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.865084E+00 | loss scale: 1.0 | grad norm: 0.475 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:26:31.662365 | finish at 2025-09-10 11:54:27 + [2025-09-10 10:28:01] iteration 11000/ 11920 | consumed samples: 11264000 | elapsed time per iteration (ms): 6185.6 | throughput per GPU (TFLOP/s/GPU): 73.0 | MFU 7.38% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.860264E+00 | loss scale: 1.0 | grad norm: 0.532 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:34:50.767612 | finish at 2025-09-10 12:02:52 + [2025-09-10 10:28:07] iteration 11001/ 11920 | consumed samples: 11265024 | elapsed time per iteration (ms): 5649.9 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845747E+00 | loss scale: 1.0 | grad norm: 0.393 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:26:32.286547 | finish at 2025-09-10 11:54:39 + [2025-09-10 10:28:13] iteration 11002/ 11920 | consumed samples: 11266048 | elapsed time per iteration (ms): 5965.1 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854229E+00 | loss scale: 1.0 | grad norm: 0.430 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:31:15.990518 | finish at 2025-09-10 11:59:29 + [2025-09-10 10:28:19] iteration 11003/ 11920 | consumed samples: 11267072 | elapsed time per iteration (ms): 5652.9 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.854286E+00 | loss scale: 1.0 | grad norm: 0.243 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:26:23.726991 | finish at 2025-09-10 11:54:42 + [2025-09-10 10:28:24] iteration 11004/ 11920 | consumed samples: 11268096 | elapsed time per iteration (ms): 5654.0 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837574E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:26:19.029534 | finish at 2025-09-10 11:54:43 + [2025-09-10 10:28:30] iteration 11005/ 11920 | consumed samples: 11269120 | elapsed time per iteration (ms): 5657.7 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.837093E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:26:16.769378 | finish at 2025-09-10 11:54:47 + [2025-09-10 10:28:36] iteration 11006/ 11920 | consumed samples: 11270144 | elapsed time per iteration (ms): 5642.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.838052E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:25:56.902804 | finish at 2025-09-10 11:54:32 + [2025-09-10 10:28:41] iteration 11007/ 11920 | consumed samples: 11271168 | elapsed time per iteration (ms): 5636.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.839621E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:25:46.449600 | finish at 2025-09-10 11:54:28 + [2025-09-10 10:28:47] iteration 11008/ 11920 | consumed samples: 11272192 | elapsed time per iteration (ms): 5632.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849080E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:25:36.954529 | finish at 2025-09-10 11:54:24 + [2025-09-10 10:28:52] iteration 11009/ 11920 | consumed samples: 11273216 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841067E+00 | loss scale: 1.0 | grad norm: 0.431 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:25:26.492042 | finish at 2025-09-10 11:54:19 + [2025-09-10 10:28:58] iteration 11010/ 11920 | consumed samples: 11274240 | elapsed time per iteration (ms): 5633.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846115E+00 | loss scale: 1.0 | grad norm: 0.641 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:25:25.987816 | finish at 2025-09-10 11:54:24 + [2025-09-10 10:29:04] iteration 11011/ 11920 | consumed samples: 11275264 | elapsed time per iteration (ms): 5631.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.841212E+00 | loss scale: 1.0 | grad norm: 0.291 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:25:19.096138 | finish at 2025-09-10 11:54:23 + [2025-09-10 10:29:09] iteration 11012/ 11920 | consumed samples: 11276288 | elapsed time per iteration (ms): 5631.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.849231E+00 | loss scale: 1.0 | grad norm: 0.321 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:25:13.528215 | finish at 2025-09-10 11:54:23 + [2025-09-10 10:29:15] iteration 11013/ 11920 | consumed samples: 11277312 | elapsed time per iteration (ms): 5632.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.847740E+00 | loss scale: 1.0 | grad norm: 0.400 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:25:08.896496 | finish at 2025-09-10 11:54:24 + [2025-09-10 10:29:21] iteration 11014/ 11920 | consumed samples: 11278336 | elapsed time per iteration (ms): 5629.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.845323E+00 | loss scale: 1.0 | grad norm: 0.271 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:25:00.341177 | finish at 2025-09-10 11:54:21 + [2025-09-10 10:29:26] iteration 11015/ 11920 | consumed samples: 11279360 | elapsed time per iteration (ms): 5636.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827629E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:25:00.666233 | finish at 2025-09-10 11:54:27 + [2025-09-10 10:29:32] iteration 11016/ 11920 | consumed samples: 11280384 | elapsed time per iteration (ms): 5635.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.833701E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:24:54.421911 | finish at 2025-09-10 11:54:26 + [2025-09-10 10:29:38] iteration 11017/ 11920 | consumed samples: 11281408 | elapsed time per iteration (ms): 5637.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.846952E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:24:50.803344 | finish at 2025-09-10 11:54:28 + [2025-09-10 10:29:43] iteration 11018/ 11920 | consumed samples: 11282432 | elapsed time per iteration (ms): 5629.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824438E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:24:37.719889 | finish at 2025-09-10 11:54:21 + [2025-09-10 10:29:49] iteration 11019/ 11920 | consumed samples: 11283456 | elapsed time per iteration (ms): 5635.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812875E+00 | loss scale: 1.0 | grad norm: 0.131 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:24:37.315866 | finish at 2025-09-10 11:54:26 + [2025-09-10 10:29:54] iteration 11020/ 11920 | consumed samples: 11284480 | elapsed time per iteration (ms): 5629.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828960E+00 | loss scale: 1.0 | grad norm: 0.114 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:24:26.316462 | finish at 2025-09-10 11:54:21 + [2025-09-10 10:30:00] iteration 11021/ 11920 | consumed samples: 11285504 | elapsed time per iteration (ms): 5627.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824523E+00 | loss scale: 1.0 | grad norm: 0.117 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:24:18.702663 | finish at 2025-09-10 11:54:19 + [2025-09-10 10:30:06] iteration 11022/ 11920 | consumed samples: 11286528 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.831147E+00 | loss scale: 1.0 | grad norm: 0.120 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:24:11.808158 | finish at 2025-09-10 11:54:18 + [2025-09-10 10:30:11] iteration 11023/ 11920 | consumed samples: 11287552 | elapsed time per iteration (ms): 5631.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828027E+00 | loss scale: 1.0 | grad norm: 0.125 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:24:11.819925 | finish at 2025-09-10 11:54:23 + [2025-09-10 10:30:17] iteration 11024/ 11920 | consumed samples: 11288576 | elapsed time per iteration (ms): 5630.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827327E+00 | loss scale: 1.0 | grad norm: 0.142 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:24:04.725769 | finish at 2025-09-10 11:54:22 + [2025-09-10 10:30:23] iteration 11025/ 11920 | consumed samples: 11289600 | elapsed time per iteration (ms): 5631.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822813E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:23:59.891633 | finish at 2025-09-10 11:54:22 + [2025-09-10 10:30:28] iteration 11026/ 11920 | consumed samples: 11290624 | elapsed time per iteration (ms): 5645.4 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814152E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:24:06.976132 | finish at 2025-09-10 11:54:35 + [2025-09-10 10:30:34] iteration 11027/ 11920 | consumed samples: 11291648 | elapsed time per iteration (ms): 5867.0 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.833325E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:27:19.191278 | finish at 2025-09-10 11:57:53 + [2025-09-10 10:30:40] iteration 11028/ 11920 | consumed samples: 11292672 | elapsed time per iteration (ms): 5646.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814932E+00 | loss scale: 1.0 | grad norm: 0.142 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:23:56.476275 | finish at 2025-09-10 11:54:36 + [2025-09-10 10:30:45] iteration 11029/ 11920 | consumed samples: 11293696 | elapsed time per iteration (ms): 5637.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815094E+00 | loss scale: 1.0 | grad norm: 0.107 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:23:42.654809 | finish at 2025-09-10 11:54:28 + [2025-09-10 10:30:51] iteration 11030/ 11920 | consumed samples: 11294720 | elapsed time per iteration (ms): 5846.8 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826317E+00 | loss scale: 1.0 | grad norm: 0.114 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:26:43.615582 | finish at 2025-09-10 11:57:35 + [2025-09-10 10:30:57] iteration 11031/ 11920 | consumed samples: 11295744 | elapsed time per iteration (ms): 5626.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827022E+00 | loss scale: 1.0 | grad norm: 0.130 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:23:21.875953 | finish at 2025-09-10 11:54:19 + [2025-09-10 10:31:02] iteration 11032/ 11920 | consumed samples: 11296768 | elapsed time per iteration (ms): 5627.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820889E+00 | loss scale: 1.0 | grad norm: 0.135 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:23:17.375450 | finish at 2025-09-10 11:54:20 + [2025-09-10 10:31:08] iteration 11033/ 11920 | consumed samples: 11297792 | elapsed time per iteration (ms): 5629.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808756E+00 | loss scale: 1.0 | grad norm: 0.125 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:23:13.700556 | finish at 2025-09-10 11:54:22 + [2025-09-10 10:31:14] iteration 11034/ 11920 | consumed samples: 11298816 | elapsed time per iteration (ms): 5628.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805261E+00 | loss scale: 1.0 | grad norm: 0.114 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:23:06.505822 | finish at 2025-09-10 11:54:20 + [2025-09-10 10:31:20] iteration 11035/ 11920 | consumed samples: 11299840 | elapsed time per iteration (ms): 5937.2 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808588E+00 | loss scale: 1.0 | grad norm: 0.118 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:27:34.388723 | finish at 2025-09-10 11:58:54 + [2025-09-10 10:31:25] iteration 11036/ 11920 | consumed samples: 11300864 | elapsed time per iteration (ms): 5637.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816061E+00 | loss scale: 1.0 | grad norm: 0.125 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:23:03.662800 | finish at 2025-09-10 11:54:29 + [2025-09-10 10:31:31] iteration 11037/ 11920 | consumed samples: 11301888 | elapsed time per iteration (ms): 5632.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820481E+00 | loss scale: 1.0 | grad norm: 0.125 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:22:53.246286 | finish at 2025-09-10 11:54:24 + [2025-09-10 10:31:37] iteration 11038/ 11920 | consumed samples: 11302912 | elapsed time per iteration (ms): 5639.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823510E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:22:53.669653 | finish at 2025-09-10 11:54:30 + [2025-09-10 10:31:42] iteration 11039/ 11920 | consumed samples: 11303936 | elapsed time per iteration (ms): 5637.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829969E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:22:46.482107 | finish at 2025-09-10 11:54:29 + [2025-09-10 10:31:48] iteration 11040/ 11920 | consumed samples: 11304960 | elapsed time per iteration (ms): 5626.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819065E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:22:31.577339 | finish at 2025-09-10 11:54:19 + [2025-09-10 10:31:54] iteration 11041/ 11920 | consumed samples: 11305984 | elapsed time per iteration (ms): 5838.8 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816141E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:25:32.342882 | finish at 2025-09-10 11:57:26 + [2025-09-10 10:31:59] iteration 11042/ 11920 | consumed samples: 11307008 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813071E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:22:18.049367 | finish at 2025-09-10 11:54:17 + [2025-09-10 10:32:05] iteration 11043/ 11920 | consumed samples: 11308032 | elapsed time per iteration (ms): 5626.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829210E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:22:14.750699 | finish at 2025-09-10 11:54:20 + [2025-09-10 10:32:11] iteration 11044/ 11920 | consumed samples: 11309056 | elapsed time per iteration (ms): 5629.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811277E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:22:11.592507 | finish at 2025-09-10 11:54:22 + [2025-09-10 10:32:16] iteration 11045/ 11920 | consumed samples: 11310080 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820867E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:22:05.662011 | finish at 2025-09-10 11:54:22 + [2025-09-10 10:32:22] iteration 11046/ 11920 | consumed samples: 11311104 | elapsed time per iteration (ms): 5641.0 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813075E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:22:10.226527 | finish at 2025-09-10 11:54:32 + [2025-09-10 10:32:28] iteration 11047/ 11920 | consumed samples: 11312128 | elapsed time per iteration (ms): 5851.2 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819769E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:25:08.122459 | finish at 2025-09-10 11:57:36 + [2025-09-10 10:32:34] iteration 11048/ 11920 | consumed samples: 11313152 | elapsed time per iteration (ms): 5849.9 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820860E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:25:01.080582 | finish at 2025-09-10 11:57:35 + [2025-09-10 10:32:39] iteration 11049/ 11920 | consumed samples: 11314176 | elapsed time per iteration (ms): 5633.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.835018E+00 | loss scale: 1.0 | grad norm: 0.132 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:21:46.888232 | finish at 2025-09-10 11:54:26 + [2025-09-10 10:32:45] iteration 11050/ 11920 | consumed samples: 11315200 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.829049E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:21:33.217542 | finish at 2025-09-10 11:54:18 + [2025-09-10 10:32:50] iteration 11051/ 11920 | consumed samples: 11316224 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824234E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:21:29.074118 | finish at 2025-09-10 11:54:20 + [2025-09-10 10:32:56] iteration 11052/ 11920 | consumed samples: 11317248 | elapsed time per iteration (ms): 5625.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819508E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:21:22.866711 | finish at 2025-09-10 11:54:19 + [2025-09-10 10:33:02] iteration 11053/ 11920 | consumed samples: 11318272 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822975E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:21:14.233674 | finish at 2025-09-10 11:54:16 + [2025-09-10 10:33:08] iteration 11054/ 11920 | consumed samples: 11319296 | elapsed time per iteration (ms): 6214.8 | throughput per GPU (TFLOP/s/GPU): 72.6 | MFU 7.35% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825049E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:29:42.030530 | finish at 2025-09-10 12:02:50 + [2025-09-10 10:33:14] iteration 11055/ 11920 | consumed samples: 11320320 | elapsed time per iteration (ms): 5923.3 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814225E+00 | loss scale: 1.0 | grad norm: 0.130 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:25:23.683603 | finish at 2025-09-10 11:58:38 + [2025-09-10 10:33:19] iteration 11056/ 11920 | consumed samples: 11321344 | elapsed time per iteration (ms): 5633.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826594E+00 | loss scale: 1.0 | grad norm: 0.119 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:21:07.030769 | finish at 2025-09-10 11:54:26 + [2025-09-10 10:33:25] iteration 11057/ 11920 | consumed samples: 11322368 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825450E+00 | loss scale: 1.0 | grad norm: 0.114 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:20:53.853822 | finish at 2025-09-10 11:54:19 + [2025-09-10 10:33:31] iteration 11058/ 11920 | consumed samples: 11323392 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825118E+00 | loss scale: 1.0 | grad norm: 0.122 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:20:47.405920 | finish at 2025-09-10 11:54:18 + [2025-09-10 10:33:36] iteration 11059/ 11920 | consumed samples: 11324416 | elapsed time per iteration (ms): 5635.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816563E+00 | loss scale: 1.0 | grad norm: 0.116 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:20:51.828393 | finish at 2025-09-10 11:54:28 + [2025-09-10 10:33:42] iteration 11060/ 11920 | consumed samples: 11325440 | elapsed time per iteration (ms): 5628.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827453E+00 | loss scale: 1.0 | grad norm: 0.116 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:20:40.847483 | finish at 2025-09-10 11:54:23 + [2025-09-10 10:33:48] iteration 11061/ 11920 | consumed samples: 11326464 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813269E+00 | loss scale: 1.0 | grad norm: 0.128 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:20:35.582523 | finish at 2025-09-10 11:54:23 + [2025-09-10 10:33:53] iteration 11062/ 11920 | consumed samples: 11327488 | elapsed time per iteration (ms): 5622.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817416E+00 | loss scale: 1.0 | grad norm: 0.134 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:20:24.170616 | finish at 2025-09-10 11:54:17 + [2025-09-10 10:33:59] iteration 11063/ 11920 | consumed samples: 11328512 | elapsed time per iteration (ms): 5619.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818431E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:20:16.055278 | finish at 2025-09-10 11:54:15 + [2025-09-10 10:34:04] iteration 11064/ 11920 | consumed samples: 11329536 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810751E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:20:12.514841 | finish at 2025-09-10 11:54:17 + [2025-09-10 10:34:10] iteration 11065/ 11920 | consumed samples: 11330560 | elapsed time per iteration (ms): 5628.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814104E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:20:11.920245 | finish at 2025-09-10 11:54:22 + [2025-09-10 10:34:16] iteration 11066/ 11920 | consumed samples: 11331584 | elapsed time per iteration (ms): 5869.6 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821116E+00 | loss scale: 1.0 | grad norm: 0.129 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:23:32.617477 | finish at 2025-09-10 11:57:49 + [2025-09-10 10:34:22] iteration 11067/ 11920 | consumed samples: 11332608 | elapsed time per iteration (ms): 6003.1 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813429E+00 | loss scale: 1.0 | grad norm: 0.114 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:25:20.662127 | finish at 2025-09-10 11:59:43 + [2025-09-10 10:34:28] iteration 11068/ 11920 | consumed samples: 11333632 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819304E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:19:53.203042 | finish at 2025-09-10 11:54:21 + [2025-09-10 10:34:34] iteration 11069/ 11920 | consumed samples: 11334656 | elapsed time per iteration (ms): 6300.0 | throughput per GPU (TFLOP/s/GPU): 71.7 | MFU 7.25% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.833709E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:29:21.263641 | finish at 2025-09-10 12:03:55 + [2025-09-10 10:34:40] iteration 11070/ 11920 | consumed samples: 11335680 | elapsed time per iteration (ms): 5859.3 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.825387E+00 | loss scale: 1.0 | grad norm: 0.247 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:23:00.428827 | finish at 2025-09-10 11:57:40 + [2025-09-10 10:34:45] iteration 11071/ 11920 | consumed samples: 11336704 | elapsed time per iteration (ms): 5629.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.832448E+00 | loss scale: 1.0 | grad norm: 0.238 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:19:39.473764 | finish at 2025-09-10 11:54:25 + [2025-09-10 10:34:51] iteration 11072/ 11920 | consumed samples: 11337728 | elapsed time per iteration (ms): 5619.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815785E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:19:25.534473 | finish at 2025-09-10 11:54:17 + [2025-09-10 10:34:57] iteration 11073/ 11920 | consumed samples: 11338752 | elapsed time per iteration (ms): 5841.5 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823626E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:22:27.785473 | finish at 2025-09-10 11:57:25 + [2025-09-10 10:35:02] iteration 11074/ 11920 | consumed samples: 11339776 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.823612E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:19:17.792722 | finish at 2025-09-10 11:54:20 + [2025-09-10 10:35:08] iteration 11075/ 11920 | consumed samples: 11340800 | elapsed time per iteration (ms): 5621.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802942E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:19:09.743230 | finish at 2025-09-10 11:54:18 + [2025-09-10 10:35:14] iteration 11076/ 11920 | consumed samples: 11341824 | elapsed time per iteration (ms): 5618.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810432E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:19:02.163908 | finish at 2025-09-10 11:54:16 + [2025-09-10 10:35:19] iteration 11077/ 11920 | consumed samples: 11342848 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818822E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:18:57.457912 | finish at 2025-09-10 11:54:17 + [2025-09-10 10:35:25] iteration 11078/ 11920 | consumed samples: 11343872 | elapsed time per iteration (ms): 5633.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805068E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:19:03.633929 | finish at 2025-09-10 11:54:29 + [2025-09-10 10:35:31] iteration 11079/ 11920 | consumed samples: 11344896 | elapsed time per iteration (ms): 5637.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812433E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:19:00.992571 | finish at 2025-09-10 11:54:32 + [2025-09-10 10:35:36] iteration 11080/ 11920 | consumed samples: 11345920 | elapsed time per iteration (ms): 5630.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819229E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:18:49.566393 | finish at 2025-09-10 11:54:26 + [2025-09-10 10:35:42] iteration 11081/ 11920 | consumed samples: 11346944 | elapsed time per iteration (ms): 5884.1 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821076E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:22:16.747465 | finish at 2025-09-10 11:57:59 + [2025-09-10 10:35:48] iteration 11082/ 11920 | consumed samples: 11347968 | elapsed time per iteration (ms): 5632.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807152E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:18:39.896486 | finish at 2025-09-10 11:54:28 + [2025-09-10 10:35:54] iteration 11083/ 11920 | consumed samples: 11348992 | elapsed time per iteration (ms): 5822.8 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.84% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818179E+00 | loss scale: 1.0 | grad norm: 0.131 | num zeros: 11.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:21:13.677547 | finish at 2025-09-10 11:57:07 + [2025-09-10 10:35:59] iteration 11084/ 11920 | consumed samples: 11350016 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820140E+00 | loss scale: 1.0 | grad norm: 0.136 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:18:19.474154 | finish at 2025-09-10 11:54:19 + [2025-09-10 10:36:05] iteration 11085/ 11920 | consumed samples: 11351040 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.796552E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:18:16.136614 | finish at 2025-09-10 11:54:21 + [2025-09-10 10:36:10] iteration 11086/ 11920 | consumed samples: 11352064 | elapsed time per iteration (ms): 5618.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812772E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:18:05.398107 | finish at 2025-09-10 11:54:16 + [2025-09-10 10:36:16] iteration 11087/ 11920 | consumed samples: 11353088 | elapsed time per iteration (ms): 5628.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811701E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:18:08.636214 | finish at 2025-09-10 11:54:25 + [2025-09-10 10:36:22] iteration 11088/ 11920 | consumed samples: 11354112 | elapsed time per iteration (ms): 5630.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807279E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:18:04.779785 | finish at 2025-09-10 11:54:26 + [2025-09-10 10:36:27] iteration 11089/ 11920 | consumed samples: 11355136 | elapsed time per iteration (ms): 5636.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817594E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:18:04.170143 | finish at 2025-09-10 11:54:32 + [2025-09-10 10:36:33] iteration 11090/ 11920 | consumed samples: 11356160 | elapsed time per iteration (ms): 5837.7 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799441E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:20:45.330091 | finish at 2025-09-10 11:57:19 + [2025-09-10 10:36:39] iteration 11091/ 11920 | consumed samples: 11357184 | elapsed time per iteration (ms): 5637.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815270E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:17:53.777886 | finish at 2025-09-10 11:54:33 + [2025-09-10 10:36:44] iteration 11092/ 11920 | consumed samples: 11358208 | elapsed time per iteration (ms): 5636.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812267E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:17:47.103433 | finish at 2025-09-10 11:54:32 + [2025-09-10 10:36:50] iteration 11093/ 11920 | consumed samples: 11359232 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810059E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:17:28.966711 | finish at 2025-09-10 11:54:19 + [2025-09-10 10:36:56] iteration 11094/ 11920 | consumed samples: 11360256 | elapsed time per iteration (ms): 5618.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813352E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:17:21.098804 | finish at 2025-09-10 11:54:17 + [2025-09-10 10:37:01] iteration 11095/ 11920 | consumed samples: 11361280 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814390E+00 | loss scale: 1.0 | grad norm: 0.133 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:17:16.772329 | finish at 2025-09-10 11:54:18 + [2025-09-10 10:37:07] iteration 11096/ 11920 | consumed samples: 11362304 | elapsed time per iteration (ms): 5638.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819508E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:17:26.423773 | finish at 2025-09-10 11:54:33 + [2025-09-10 10:37:13] iteration 11097/ 11920 | consumed samples: 11363328 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813851E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:17:08.994925 | finish at 2025-09-10 11:54:22 + [2025-09-10 10:37:18] iteration 11098/ 11920 | consumed samples: 11364352 | elapsed time per iteration (ms): 5892.3 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816043E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:20:43.500103 | finish at 2025-09-10 11:58:02 + [2025-09-10 10:37:24] iteration 11099/ 11920 | consumed samples: 11365376 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811840E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:16:54.653718 | finish at 2025-09-10 11:54:19 + [2025-09-10 10:37:30] iteration 11100/ 11920 | consumed samples: 11366400 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816276E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:16:48.519750 | finish at 2025-09-10 11:54:18 + [2025-09-10 10:37:35] iteration 11101/ 11920 | consumed samples: 11367424 | elapsed time per iteration (ms): 5629.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819616E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:16:50.887497 | finish at 2025-09-10 11:54:26 + [2025-09-10 10:37:41] iteration 11102/ 11920 | consumed samples: 11368448 | elapsed time per iteration (ms): 5915.5 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819360E+00 | loss scale: 1.0 | grad norm: 0.125 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:20:38.860607 | finish at 2025-09-10 11:58:20 + [2025-09-10 10:37:47] iteration 11103/ 11920 | consumed samples: 11369472 | elapsed time per iteration (ms): 5626.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812260E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:16:37.196160 | finish at 2025-09-10 11:54:24 + [2025-09-10 10:37:53] iteration 11104/ 11920 | consumed samples: 11370496 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809070E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:16:29.814594 | finish at 2025-09-10 11:54:22 + [2025-09-10 10:37:58] iteration 11105/ 11920 | consumed samples: 11371520 | elapsed time per iteration (ms): 5628.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801749E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:16:27.052025 | finish at 2025-09-10 11:54:25 + [2025-09-10 10:38:04] iteration 11106/ 11920 | consumed samples: 11372544 | elapsed time per iteration (ms): 5956.2 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806345E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:20:48.327687 | finish at 2025-09-10 11:58:52 + [2025-09-10 10:38:10] iteration 11107/ 11920 | consumed samples: 11373568 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808166E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:16:09.071149 | finish at 2025-09-10 11:54:19 + [2025-09-10 10:38:15] iteration 11108/ 11920 | consumed samples: 11374592 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801589E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 10.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:16:07.894742 | finish at 2025-09-10 11:54:23 + [2025-09-10 10:38:21] iteration 11109/ 11920 | consumed samples: 11375616 | elapsed time per iteration (ms): 5619.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827433E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:15:57.168486 | finish at 2025-09-10 11:54:18 + [2025-09-10 10:38:27] iteration 11110/ 11920 | consumed samples: 11376640 | elapsed time per iteration (ms): 5851.5 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815967E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:18:59.729705 | finish at 2025-09-10 11:57:27 + [2025-09-10 10:38:32] iteration 11111/ 11920 | consumed samples: 11377664 | elapsed time per iteration (ms): 5626.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804026E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:15:51.602905 | finish at 2025-09-10 11:54:24 + [2025-09-10 10:38:38] iteration 11112/ 11920 | consumed samples: 11378688 | elapsed time per iteration (ms): 5641.7 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806661E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:15:58.469543 | finish at 2025-09-10 11:54:37 + [2025-09-10 10:38:44] iteration 11113/ 11920 | consumed samples: 11379712 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.795898E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:15:36.512416 | finish at 2025-09-10 11:54:20 + [2025-09-10 10:38:49] iteration 11114/ 11920 | consumed samples: 11380736 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810614E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:15:32.375633 | finish at 2025-09-10 11:54:22 + [2025-09-10 10:38:55] iteration 11115/ 11920 | consumed samples: 11381760 | elapsed time per iteration (ms): 5947.1 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804610E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:19:47.396438 | finish at 2025-09-10 11:58:43 + [2025-09-10 10:39:01] iteration 11116/ 11920 | consumed samples: 11382784 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817682E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:15:19.353816 | finish at 2025-09-10 11:54:20 + [2025-09-10 10:39:07] iteration 11117/ 11920 | consumed samples: 11383808 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.797180E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:15:13.396543 | finish at 2025-09-10 11:54:20 + [2025-09-10 10:39:12] iteration 11118/ 11920 | consumed samples: 11384832 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801898E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:15:11.144069 | finish at 2025-09-10 11:54:23 + [2025-09-10 10:39:18] iteration 11119/ 11920 | consumed samples: 11385856 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.793724E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:15:05.643524 | finish at 2025-09-10 11:54:23 + [2025-09-10 10:39:23] iteration 11120/ 11920 | consumed samples: 11386880 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814806E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:14:59.533844 | finish at 2025-09-10 11:54:23 + [2025-09-10 10:39:29] iteration 11121/ 11920 | consumed samples: 11387904 | elapsed time per iteration (ms): 5627.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814351E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:14:56.601141 | finish at 2025-09-10 11:54:26 + [2025-09-10 10:39:35] iteration 11122/ 11920 | consumed samples: 11388928 | elapsed time per iteration (ms): 5628.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807822E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:14:51.335607 | finish at 2025-09-10 11:54:26 + [2025-09-10 10:39:40] iteration 11123/ 11920 | consumed samples: 11389952 | elapsed time per iteration (ms): 5631.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815440E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:14:48.662171 | finish at 2025-09-10 11:54:29 + [2025-09-10 10:39:46] iteration 11124/ 11920 | consumed samples: 11390976 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806756E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:14:35.018042 | finish at 2025-09-10 11:54:21 + [2025-09-10 10:39:52] iteration 11125/ 11920 | consumed samples: 11392000 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799426E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:14:30.410024 | finish at 2025-09-10 11:54:22 + [2025-09-10 10:39:57] iteration 11126/ 11920 | consumed samples: 11393024 | elapsed time per iteration (ms): 5635.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828686E+00 | loss scale: 1.0 | grad norm: 0.134 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:14:34.395388 | finish at 2025-09-10 11:54:32 + [2025-09-10 10:40:03] iteration 11127/ 11920 | consumed samples: 11394048 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809090E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:14:24.071105 | finish at 2025-09-10 11:54:27 + [2025-09-10 10:40:08] iteration 11128/ 11920 | consumed samples: 11395072 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809239E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:14:15.068733 | finish at 2025-09-10 11:54:23 + [2025-09-10 10:40:14] iteration 11129/ 11920 | consumed samples: 11396096 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803163E+00 | loss scale: 1.0 | grad norm: 0.134 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:14:05.568141 | finish at 2025-09-10 11:54:20 + [2025-09-10 10:40:20] iteration 11130/ 11920 | consumed samples: 11397120 | elapsed time per iteration (ms): 5992.2 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807056E+00 | loss scale: 1.0 | grad norm: 0.131 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:18:53.849409 | finish at 2025-09-10 11:59:14 + [2025-09-10 10:40:26] iteration 11131/ 11920 | consumed samples: 11398144 | elapsed time per iteration (ms): 6211.9 | throughput per GPU (TFLOP/s/GPU): 72.7 | MFU 7.35% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818229E+00 | loss scale: 1.0 | grad norm: 0.130 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:21:41.203393 | finish at 2025-09-10 12:02:07 + [2025-09-10 10:40:32] iteration 11132/ 11920 | consumed samples: 11399168 | elapsed time per iteration (ms): 5629.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802300E+00 | loss scale: 1.0 | grad norm: 0.123 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:13:56.032968 | finish at 2025-09-10 11:54:28 + [2025-09-10 10:40:37] iteration 11133/ 11920 | consumed samples: 11400192 | elapsed time per iteration (ms): 5638.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806564E+00 | loss scale: 1.0 | grad norm: 0.135 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:13:57.491787 | finish at 2025-09-10 11:54:35 + [2025-09-10 10:40:43] iteration 11134/ 11920 | consumed samples: 11401216 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.791283E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:13:43.076933 | finish at 2025-09-10 11:54:26 + [2025-09-10 10:40:49] iteration 11135/ 11920 | consumed samples: 11402240 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805730E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:13:34.307404 | finish at 2025-09-10 11:54:23 + [2025-09-10 10:40:55] iteration 11136/ 11920 | consumed samples: 11403264 | elapsed time per iteration (ms): 5977.2 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801229E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:18:06.098095 | finish at 2025-09-10 11:59:01 + [2025-09-10 10:41:00] iteration 11137/ 11920 | consumed samples: 11404288 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802127E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:13:24.915817 | finish at 2025-09-10 11:54:25 + [2025-09-10 10:41:06] iteration 11138/ 11920 | consumed samples: 11405312 | elapsed time per iteration (ms): 5869.7 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803037E+00 | loss scale: 1.0 | grad norm: 0.137 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:16:30.112636 | finish at 2025-09-10 11:57:36 + [2025-09-10 10:41:12] iteration 11139/ 11920 | consumed samples: 11406336 | elapsed time per iteration (ms): 5628.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.797824E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:13:15.602084 | finish at 2025-09-10 11:54:27 + [2025-09-10 10:41:18] iteration 11140/ 11920 | consumed samples: 11407360 | elapsed time per iteration (ms): 5858.9 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809618E+00 | loss scale: 1.0 | grad norm: 0.137 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:16:09.965115 | finish at 2025-09-10 11:57:28 + [2025-09-10 10:41:23] iteration 11141/ 11920 | consumed samples: 11408384 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817332E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:13:00.262323 | finish at 2025-09-10 11:54:24 + [2025-09-10 10:41:29] iteration 11142/ 11920 | consumed samples: 11409408 | elapsed time per iteration (ms): 5842.0 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800576E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:15:45.112733 | finish at 2025-09-10 11:57:14 + [2025-09-10 10:41:35] iteration 11143/ 11920 | consumed samples: 11410432 | elapsed time per iteration (ms): 5637.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.789886E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:12:59.952214 | finish at 2025-09-10 11:54:35 + [2025-09-10 10:41:40] iteration 11144/ 11920 | consumed samples: 11411456 | elapsed time per iteration (ms): 5638.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810457E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:12:55.624916 | finish at 2025-09-10 11:54:36 + [2025-09-10 10:41:46] iteration 11145/ 11920 | consumed samples: 11412480 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.819998E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:12:38.646619 | finish at 2025-09-10 11:54:25 + [2025-09-10 10:41:52] iteration 11146/ 11920 | consumed samples: 11413504 | elapsed time per iteration (ms): 5616.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820205E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:12:26.839128 | finish at 2025-09-10 11:54:19 + [2025-09-10 10:41:57] iteration 11147/ 11920 | consumed samples: 11414528 | elapsed time per iteration (ms): 5619.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804468E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:12:23.948817 | finish at 2025-09-10 11:54:21 + [2025-09-10 10:42:03] iteration 11148/ 11920 | consumed samples: 11415552 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814524E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:12:21.578600 | finish at 2025-09-10 11:54:25 + [2025-09-10 10:42:09] iteration 11149/ 11920 | consumed samples: 11416576 | elapsed time per iteration (ms): 5618.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.792783E+00 | loss scale: 1.0 | grad norm: 0.240 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:12:12.197865 | finish at 2025-09-10 11:54:21 + [2025-09-10 10:42:14] iteration 11150/ 11920 | consumed samples: 11417600 | elapsed time per iteration (ms): 5633.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816891E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:12:17.855291 | finish at 2025-09-10 11:54:32 + [2025-09-10 10:42:20] iteration 11151/ 11920 | consumed samples: 11418624 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805372E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:12:03.286082 | finish at 2025-09-10 11:54:23 + [2025-09-10 10:42:26] iteration 11152/ 11920 | consumed samples: 11419648 | elapsed time per iteration (ms): 5964.1 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805927E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:16:20.423035 | finish at 2025-09-10 11:58:46 + [2025-09-10 10:42:31] iteration 11153/ 11920 | consumed samples: 11420672 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806127E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:11:53.811587 | finish at 2025-09-10 11:54:25 + [2025-09-10 10:42:37] iteration 11154/ 11920 | consumed samples: 11421696 | elapsed time per iteration (ms): 5845.9 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801736E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:14:37.972229 | finish at 2025-09-10 11:57:15 + [2025-09-10 10:42:43] iteration 11155/ 11920 | consumed samples: 11422720 | elapsed time per iteration (ms): 5627.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813186E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:11:45.199689 | finish at 2025-09-10 11:54:28 + [2025-09-10 10:42:49] iteration 11156/ 11920 | consumed samples: 11423744 | elapsed time per iteration (ms): 5845.6 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802827E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:14:26.029937 | finish at 2025-09-10 11:57:15 + [2025-09-10 10:42:54] iteration 11157/ 11920 | consumed samples: 11424768 | elapsed time per iteration (ms): 5628.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802588E+00 | loss scale: 1.0 | grad norm: 0.136 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:11:34.466902 | finish at 2025-09-10 11:54:29 + [2025-09-10 10:43:00] iteration 11158/ 11920 | consumed samples: 11425792 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807668E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:11:22.419384 | finish at 2025-09-10 11:54:22 + [2025-09-10 10:43:06] iteration 11159/ 11920 | consumed samples: 11426816 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803355E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:11:17.886215 | finish at 2025-09-10 11:54:23 + [2025-09-10 10:43:11] iteration 11160/ 11920 | consumed samples: 11427840 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.797511E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:11:11.622105 | finish at 2025-09-10 11:54:23 + [2025-09-10 10:43:17] iteration 11161/ 11920 | consumed samples: 11428864 | elapsed time per iteration (ms): 5837.8 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815187E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:13:50.880629 | finish at 2025-09-10 11:57:08 + [2025-09-10 10:43:23] iteration 11162/ 11920 | consumed samples: 11429888 | elapsed time per iteration (ms): 6316.4 | throughput per GPU (TFLOP/s/GPU): 71.5 | MFU 7.23% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806675E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:19:47.826359 | finish at 2025-09-10 12:03:11 + [2025-09-10 10:43:29] iteration 11163/ 11920 | consumed samples: 11430912 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818222E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:10:58.964065 | finish at 2025-09-10 11:54:28 + [2025-09-10 10:43:35] iteration 11164/ 11920 | consumed samples: 11431936 | elapsed time per iteration (ms): 5633.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809200E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:10:58.964828 | finish at 2025-09-10 11:54:34 + [2025-09-10 10:43:40] iteration 11165/ 11920 | consumed samples: 11432960 | elapsed time per iteration (ms): 5633.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805484E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:10:53.261974 | finish at 2025-09-10 11:54:34 + [2025-09-10 10:43:46] iteration 11166/ 11920 | consumed samples: 11433984 | elapsed time per iteration (ms): 5628.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.793559E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:10:43.840092 | finish at 2025-09-10 11:54:30 + [2025-09-10 10:43:52] iteration 11167/ 11920 | consumed samples: 11435008 | elapsed time per iteration (ms): 5854.7 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799979E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:13:28.572111 | finish at 2025-09-10 11:57:20 + [2025-09-10 10:43:58] iteration 11168/ 11920 | consumed samples: 11436032 | elapsed time per iteration (ms): 5922.3 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.796175E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:14:13.544037 | finish at 2025-09-10 11:58:11 + [2025-09-10 10:44:03] iteration 11169/ 11920 | consumed samples: 11437056 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817028E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:10:24.000960 | finish at 2025-09-10 11:54:27 + [2025-09-10 10:44:09] iteration 11170/ 11920 | consumed samples: 11438080 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809805E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:10:15.015292 | finish at 2025-09-10 11:54:24 + [2025-09-10 10:44:15] iteration 11171/ 11920 | consumed samples: 11439104 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811505E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:10:10.472439 | finish at 2025-09-10 11:54:25 + [2025-09-10 10:44:20] iteration 11172/ 11920 | consumed samples: 11440128 | elapsed time per iteration (ms): 5618.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798904E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:10:02.782449 | finish at 2025-09-10 11:54:23 + [2025-09-10 10:44:26] iteration 11173/ 11920 | consumed samples: 11441152 | elapsed time per iteration (ms): 5636.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.796645E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:10:10.137354 | finish at 2025-09-10 11:54:36 + [2025-09-10 10:44:32] iteration 11174/ 11920 | consumed samples: 11442176 | elapsed time per iteration (ms): 5928.3 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809894E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:13:42.504258 | finish at 2025-09-10 11:58:14 + [2025-09-10 10:44:37] iteration 11175/ 11920 | consumed samples: 11443200 | elapsed time per iteration (ms): 5638.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806033E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:10:00.706993 | finish at 2025-09-10 11:54:38 + [2025-09-10 10:44:43] iteration 11176/ 11920 | consumed samples: 11444224 | elapsed time per iteration (ms): 5639.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813527E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 10.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:09:55.720699 | finish at 2025-09-10 11:54:39 + [2025-09-10 10:44:49] iteration 11177/ 11920 | consumed samples: 11445248 | elapsed time per iteration (ms): 5630.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812240E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:09:43.790339 | finish at 2025-09-10 11:54:32 + [2025-09-10 10:44:54] iteration 11178/ 11920 | consumed samples: 11446272 | elapsed time per iteration (ms): 5633.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802276E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:09:40.249017 | finish at 2025-09-10 11:54:35 + [2025-09-10 10:45:00] iteration 11179/ 11920 | consumed samples: 11447296 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800974E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:09:25.758530 | finish at 2025-09-10 11:54:26 + [2025-09-10 10:45:06] iteration 11180/ 11920 | consumed samples: 11448320 | elapsed time per iteration (ms): 5870.1 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806988E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:12:23.851609 | finish at 2025-09-10 11:57:30 + [2025-09-10 10:45:11] iteration 11181/ 11920 | consumed samples: 11449344 | elapsed time per iteration (ms): 5619.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808857E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:09:12.728689 | finish at 2025-09-10 11:54:24 + [2025-09-10 10:45:17] iteration 11182/ 11920 | consumed samples: 11450368 | elapsed time per iteration (ms): 5966.1 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805297E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:13:22.994628 | finish at 2025-09-10 11:58:40 + [2025-09-10 10:45:23] iteration 11183/ 11920 | consumed samples: 11451392 | elapsed time per iteration (ms): 5615.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.795232E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:08:58.319494 | finish at 2025-09-10 11:54:21 + [2025-09-10 10:45:29] iteration 11184/ 11920 | consumed samples: 11452416 | elapsed time per iteration (ms): 5996.5 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814279E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:13:33.420502 | finish at 2025-09-10 11:59:02 + [2025-09-10 10:45:35] iteration 11185/ 11920 | consumed samples: 11453440 | elapsed time per iteration (ms): 5839.2 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806964E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:11:31.838826 | finish at 2025-09-10 11:57:07 + [2025-09-10 10:45:40] iteration 11186/ 11920 | consumed samples: 11454464 | elapsed time per iteration (ms): 5627.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798484E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:08:50.400243 | finish at 2025-09-10 11:54:31 + [2025-09-10 10:45:46] iteration 11187/ 11920 | consumed samples: 11455488 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809476E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:08:40.651960 | finish at 2025-09-10 11:54:27 + [2025-09-10 10:45:52] iteration 11188/ 11920 | consumed samples: 11456512 | elapsed time per iteration (ms): 5618.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806685E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:08:32.879869 | finish at 2025-09-10 11:54:25 + [2025-09-10 10:45:57] iteration 11189/ 11920 | consumed samples: 11457536 | elapsed time per iteration (ms): 5618.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814441E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:08:27.321482 | finish at 2025-09-10 11:54:25 + [2025-09-10 10:46:03] iteration 11190/ 11920 | consumed samples: 11458560 | elapsed time per iteration (ms): 5618.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807827E+00 | loss scale: 1.0 | grad norm: 0.243 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:08:21.321900 | finish at 2025-09-10 11:54:24 + [2025-09-10 10:46:09] iteration 11191/ 11920 | consumed samples: 11459584 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826449E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:08:20.110357 | finish at 2025-09-10 11:54:29 + [2025-09-10 10:46:14] iteration 11192/ 11920 | consumed samples: 11460608 | elapsed time per iteration (ms): 5629.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806434E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:08:18.329222 | finish at 2025-09-10 11:54:32 + [2025-09-10 10:46:20] iteration 11193/ 11920 | consumed samples: 11461632 | elapsed time per iteration (ms): 5626.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818030E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:08:10.454328 | finish at 2025-09-10 11:54:30 + [2025-09-10 10:46:25] iteration 11194/ 11920 | consumed samples: 11462656 | elapsed time per iteration (ms): 5621.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817681E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:08:00.809861 | finish at 2025-09-10 11:54:26 + [2025-09-10 10:46:31] iteration 11195/ 11920 | consumed samples: 11463680 | elapsed time per iteration (ms): 5619.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803497E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:07:53.860013 | finish at 2025-09-10 11:54:25 + [2025-09-10 10:46:37] iteration 11196/ 11920 | consumed samples: 11464704 | elapsed time per iteration (ms): 5955.9 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812869E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:11:52.073120 | finish at 2025-09-10 11:58:29 + [2025-09-10 10:46:43] iteration 11197/ 11920 | consumed samples: 11465728 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806922E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:07:48.703054 | finish at 2025-09-10 11:54:31 + [2025-09-10 10:46:48] iteration 11198/ 11920 | consumed samples: 11466752 | elapsed time per iteration (ms): 5870.5 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802161E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:10:38.499342 | finish at 2025-09-10 11:57:27 + [2025-09-10 10:46:54] iteration 11199/ 11920 | consumed samples: 11467776 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799775E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:07:32.186488 | finish at 2025-09-10 11:54:26 + [2025-09-10 10:47:00] iteration 11200/ 11920 | consumed samples: 11468800 | elapsed time per iteration (ms): 5617.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812413E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:07:24.377575 | finish at 2025-09-10 11:54:24 + [2025-09-10 10:47:05] iteration 11201/ 11920 | consumed samples: 11469824 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.791696E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:07:24.505796 | finish at 2025-09-10 11:54:30 + [2025-09-10 10:47:11] iteration 11202/ 11920 | consumed samples: 11470848 | elapsed time per iteration (ms): 5618.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.792373E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:07:13.984394 | finish at 2025-09-10 11:54:25 + [2025-09-10 10:47:17] iteration 11203/ 11920 | consumed samples: 11471872 | elapsed time per iteration (ms): 6055.0 | throughput per GPU (TFLOP/s/GPU): 74.6 | MFU 7.54% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806855E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:12:21.444279 | finish at 2025-09-10 11:59:38 + [2025-09-10 10:47:23] iteration 11204/ 11920 | consumed samples: 11472896 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808802E+00 | loss scale: 1.0 | grad norm: 0.135 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:07:05.343791 | finish at 2025-09-10 11:54:28 + [2025-09-10 10:47:28] iteration 11205/ 11920 | consumed samples: 11473920 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.796062E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:07:00.832410 | finish at 2025-09-10 11:54:29 + [2025-09-10 10:47:34] iteration 11206/ 11920 | consumed samples: 11474944 | elapsed time per iteration (ms): 5625.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800367E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 11.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:06:56.901474 | finish at 2025-09-10 11:54:31 + [2025-09-10 10:47:40] iteration 11207/ 11920 | consumed samples: 11475968 | elapsed time per iteration (ms): 5632.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811950E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:06:55.849548 | finish at 2025-09-10 11:54:35 + [2025-09-10 10:47:45] iteration 11208/ 11920 | consumed samples: 11476992 | elapsed time per iteration (ms): 5626.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800953E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:06:46.073864 | finish at 2025-09-10 11:54:31 + [2025-09-10 10:47:51] iteration 11209/ 11920 | consumed samples: 11478016 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805083E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:06:35.800933 | finish at 2025-09-10 11:54:27 + [2025-09-10 10:47:56] iteration 11210/ 11920 | consumed samples: 11479040 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798018E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:06:31.658072 | finish at 2025-09-10 11:54:28 + [2025-09-10 10:48:02] iteration 11211/ 11920 | consumed samples: 11480064 | elapsed time per iteration (ms): 5618.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808334E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:06:23.360811 | finish at 2025-09-10 11:54:25 + [2025-09-10 10:48:08] iteration 11212/ 11920 | consumed samples: 11481088 | elapsed time per iteration (ms): 5628.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806931E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:06:25.130754 | finish at 2025-09-10 11:54:33 + [2025-09-10 10:48:13] iteration 11213/ 11920 | consumed samples: 11482112 | elapsed time per iteration (ms): 5619.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.793802E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:06:12.768831 | finish at 2025-09-10 11:54:26 + [2025-09-10 10:48:19] iteration 11214/ 11920 | consumed samples: 11483136 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798682E+00 | loss scale: 1.0 | grad norm: 0.139 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:06:09.611371 | finish at 2025-09-10 11:54:28 + [2025-09-10 10:48:25] iteration 11215/ 11920 | consumed samples: 11484160 | elapsed time per iteration (ms): 5908.8 | throughput per GPU (TFLOP/s/GPU): 76.4 | MFU 7.73% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809985E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:09:25.671648 | finish at 2025-09-10 11:57:50 + [2025-09-10 10:48:30] iteration 11216/ 11920 | consumed samples: 11485184 | elapsed time per iteration (ms): 5622.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806441E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:05:58.295853 | finish at 2025-09-10 11:54:29 + [2025-09-10 10:48:36] iteration 11217/ 11920 | consumed samples: 11486208 | elapsed time per iteration (ms): 5631.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.826965E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:05:59.180999 | finish at 2025-09-10 11:54:35 + [2025-09-10 10:48:42] iteration 11218/ 11920 | consumed samples: 11487232 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801729E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 11.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:05:47.301581 | finish at 2025-09-10 11:54:29 + [2025-09-10 10:48:47] iteration 11219/ 11920 | consumed samples: 11488256 | elapsed time per iteration (ms): 5622.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799130E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:05:41.403379 | finish at 2025-09-10 11:54:29 + [2025-09-10 10:48:53] iteration 11220/ 11920 | consumed samples: 11489280 | elapsed time per iteration (ms): 5947.0 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798417E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:09:22.869477 | finish at 2025-09-10 11:58:16 + [2025-09-10 10:48:59] iteration 11221/ 11920 | consumed samples: 11490304 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800866E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:05:31.733344 | finish at 2025-09-10 11:54:31 + [2025-09-10 10:49:04] iteration 11222/ 11920 | consumed samples: 11491328 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810348E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:05:26.015187 | finish at 2025-09-10 11:54:30 + [2025-09-10 10:49:10] iteration 11223/ 11920 | consumed samples: 11492352 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.820199E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:05:17.424417 | finish at 2025-09-10 11:54:28 + [2025-09-10 10:49:16] iteration 11224/ 11920 | consumed samples: 11493376 | elapsed time per iteration (ms): 5613.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.779406E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:05:06.962563 | finish at 2025-09-10 11:54:23 + [2025-09-10 10:49:22] iteration 11225/ 11920 | consumed samples: 11494400 | elapsed time per iteration (ms): 5871.5 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817588E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:08:00.686380 | finish at 2025-09-10 11:57:22 + [2025-09-10 10:49:27] iteration 11226/ 11920 | consumed samples: 11495424 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816084E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:05:02.308325 | finish at 2025-09-10 11:54:30 + [2025-09-10 10:49:33] iteration 11227/ 11920 | consumed samples: 11496448 | elapsed time per iteration (ms): 5860.6 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815117E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:07:41.406371 | finish at 2025-09-10 11:57:14 + [2025-09-10 10:49:39] iteration 11228/ 11920 | consumed samples: 11497472 | elapsed time per iteration (ms): 5633.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813030E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:04:58.054242 | finish at 2025-09-10 11:54:37 + [2025-09-10 10:49:44] iteration 11229/ 11920 | consumed samples: 11498496 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816763E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:04:43.871823 | finish at 2025-09-10 11:54:28 + [2025-09-10 10:49:50] iteration 11230/ 11920 | consumed samples: 11499520 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809212E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:04:37.827394 | finish at 2025-09-10 11:54:28 + [2025-09-10 10:49:56] iteration 11231/ 11920 | consumed samples: 11500544 | elapsed time per iteration (ms): 5628.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808243E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:04:38.053902 | finish at 2025-09-10 11:54:34 + [2025-09-10 10:50:02] iteration 11232/ 11920 | consumed samples: 11501568 | elapsed time per iteration (ms): 5956.4 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.789825E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:08:18.016594 | finish at 2025-09-10 11:58:20 + [2025-09-10 10:50:07] iteration 11233/ 11920 | consumed samples: 11502592 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798716E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:04:26.062565 | finish at 2025-09-10 11:54:33 + [2025-09-10 10:50:13] iteration 11234/ 11920 | consumed samples: 11503616 | elapsed time per iteration (ms): 5638.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801319E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:04:27.934276 | finish at 2025-09-10 11:54:41 + [2025-09-10 10:50:18] iteration 11235/ 11920 | consumed samples: 11504640 | elapsed time per iteration (ms): 5625.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806830E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:04:13.371118 | finish at 2025-09-10 11:54:32 + [2025-09-10 10:50:24] iteration 11236/ 11920 | consumed samples: 11505664 | elapsed time per iteration (ms): 5631.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.797419E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:04:11.871966 | finish at 2025-09-10 11:54:36 + [2025-09-10 10:50:30] iteration 11237/ 11920 | consumed samples: 11506688 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808974E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:03:59.429308 | finish at 2025-09-10 11:54:29 + [2025-09-10 10:50:35] iteration 11238/ 11920 | consumed samples: 11507712 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.791753E+00 | loss scale: 1.0 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:03:57.963657 | finish at 2025-09-10 11:54:33 + [2025-09-10 10:50:41] iteration 11239/ 11920 | consumed samples: 11508736 | elapsed time per iteration (ms): 5915.4 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814719E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:07:08.354622 | finish at 2025-09-10 11:57:50 + [2025-09-10 10:50:47] iteration 11240/ 11920 | consumed samples: 11509760 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.793473E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:03:41.531992 | finish at 2025-09-10 11:54:28 + [2025-09-10 10:50:53] iteration 11241/ 11920 | consumed samples: 11510784 | elapsed time per iteration (ms): 6017.2 | throughput per GPU (TFLOP/s/GPU): 75.0 | MFU 7.59% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.793389E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:08:05.690127 | finish at 2025-09-10 11:58:59 + [2025-09-10 10:50:58] iteration 11242/ 11920 | consumed samples: 11511808 | elapsed time per iteration (ms): 5618.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.794657E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:03:29.316971 | finish at 2025-09-10 11:54:28 + [2025-09-10 10:51:05] iteration 11243/ 11920 | consumed samples: 11512832 | elapsed time per iteration (ms): 6174.7 | throughput per GPU (TFLOP/s/GPU): 73.1 | MFU 7.39% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.787551E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:09:40.303390 | finish at 2025-09-10 12:00:45 + [2025-09-10 10:51:10] iteration 11244/ 11920 | consumed samples: 11513856 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802691E+00 | loss scale: 1.0 | grad norm: 0.144 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:03:20.428308 | finish at 2025-09-10 11:54:31 + [2025-09-10 10:51:16] iteration 11245/ 11920 | consumed samples: 11514880 | elapsed time per iteration (ms): 5937.1 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.795167E+00 | loss scale: 1.0 | grad norm: 0.126 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:06:47.542294 | finish at 2025-09-10 11:58:04 + [2025-09-10 10:51:22] iteration 11246/ 11920 | consumed samples: 11515904 | elapsed time per iteration (ms): 5626.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821601E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:03:12.350755 | finish at 2025-09-10 11:54:34 + [2025-09-10 10:51:27] iteration 11247/ 11920 | consumed samples: 11516928 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803651E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:03:02.499804 | finish at 2025-09-10 11:54:30 + [2025-09-10 10:51:33] iteration 11248/ 11920 | consumed samples: 11517952 | elapsed time per iteration (ms): 5619.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.792266E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:02:56.087975 | finish at 2025-09-10 11:54:29 + [2025-09-10 10:51:39] iteration 11249/ 11920 | consumed samples: 11518976 | elapsed time per iteration (ms): 5842.1 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803733E+00 | loss scale: 1.0 | grad norm: 0.258 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:05:20.068874 | finish at 2025-09-10 11:56:59 + [2025-09-10 10:51:45] iteration 11250/ 11920 | consumed samples: 11520000 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802095E+00 | loss scale: 1.0 | grad norm: 0.230 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:02:45.711737 | finish at 2025-09-10 11:54:30 + [2025-09-10 10:51:50] iteration 11251/ 11920 | consumed samples: 11521024 | elapsed time per iteration (ms): 5629.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808559E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:02:45.893796 | finish at 2025-09-10 11:54:36 + [2025-09-10 10:51:56] iteration 11252/ 11920 | consumed samples: 11522048 | elapsed time per iteration (ms): 5960.1 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.791914E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:06:21.328601 | finish at 2025-09-10 11:58:17 + [2025-09-10 10:52:02] iteration 11253/ 11920 | consumed samples: 11523072 | elapsed time per iteration (ms): 5628.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.791230E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:02:34.444052 | finish at 2025-09-10 11:54:36 + [2025-09-10 10:52:07] iteration 11254/ 11920 | consumed samples: 11524096 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817268E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:02:26.277788 | finish at 2025-09-10 11:54:34 + [2025-09-10 10:52:13] iteration 11255/ 11920 | consumed samples: 11525120 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.796970E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:02:17.374442 | finish at 2025-09-10 11:54:30 + [2025-09-10 10:52:19] iteration 11256/ 11920 | consumed samples: 11526144 | elapsed time per iteration (ms): 5618.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805578E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:02:10.342838 | finish at 2025-09-10 11:54:29 + [2025-09-10 10:52:24] iteration 11257/ 11920 | consumed samples: 11527168 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.787203E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:02:07.247990 | finish at 2025-09-10 11:54:31 + [2025-09-10 10:52:30] iteration 11258/ 11920 | consumed samples: 11528192 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802145E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:02:01.202889 | finish at 2025-09-10 11:54:31 + [2025-09-10 10:52:35] iteration 11259/ 11920 | consumed samples: 11529216 | elapsed time per iteration (ms): 5628.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802512E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:02:00.670312 | finish at 2025-09-10 11:54:36 + [2025-09-10 10:52:41] iteration 11260/ 11920 | consumed samples: 11530240 | elapsed time per iteration (ms): 5628.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804457E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:01:54.694333 | finish at 2025-09-10 11:54:36 + [2025-09-10 10:52:47] iteration 11261/ 11920 | consumed samples: 11531264 | elapsed time per iteration (ms): 5829.0 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810556E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:04:01.280360 | finish at 2025-09-10 11:56:48 + [2025-09-10 10:52:53] iteration 11262/ 11920 | consumed samples: 11532288 | elapsed time per iteration (ms): 5626.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812943E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:01:41.886460 | finish at 2025-09-10 11:54:34 + [2025-09-10 10:52:58] iteration 11263/ 11920 | consumed samples: 11533312 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.795753E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:01:32.200984 | finish at 2025-09-10 11:54:30 + [2025-09-10 10:53:04] iteration 11264/ 11920 | consumed samples: 11534336 | elapsed time per iteration (ms): 5624.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.785904E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:01:29.501232 | finish at 2025-09-10 11:54:33 + [2025-09-10 10:53:09] iteration 11265/ 11920 | consumed samples: 11535360 | elapsed time per iteration (ms): 5616.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.787523E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 10.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:01:18.878802 | finish at 2025-09-10 11:54:28 + [2025-09-10 10:53:15] iteration 11266/ 11920 | consumed samples: 11536384 | elapsed time per iteration (ms): 5621.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799090E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:01:16.107994 | finish at 2025-09-10 11:54:31 + [2025-09-10 10:53:21] iteration 11267/ 11920 | consumed samples: 11537408 | elapsed time per iteration (ms): 5617.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814085E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:01:08.442392 | finish at 2025-09-10 11:54:29 + [2025-09-10 10:53:26] iteration 11268/ 11920 | consumed samples: 11538432 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.789833E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:01:06.718247 | finish at 2025-09-10 11:54:33 + [2025-09-10 10:53:33] iteration 11269/ 11920 | consumed samples: 11539456 | elapsed time per iteration (ms): 6296.9 | throughput per GPU (TFLOP/s/GPU): 71.7 | MFU 7.25% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806589E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:08:19.278973 | finish at 2025-09-10 12:01:52 + [2025-09-10 10:53:38] iteration 11270/ 11920 | consumed samples: 11540480 | elapsed time per iteration (ms): 5634.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803727E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:01:02.516451 | finish at 2025-09-10 11:54:41 + [2025-09-10 10:53:44] iteration 11271/ 11920 | consumed samples: 11541504 | elapsed time per iteration (ms): 5633.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.791867E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:00:56.197733 | finish at 2025-09-10 11:54:40 + [2025-09-10 10:53:49] iteration 11272/ 11920 | consumed samples: 11542528 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811359E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:00:44.796221 | finish at 2025-09-10 11:54:34 + [2025-09-10 10:53:55] iteration 11273/ 11920 | consumed samples: 11543552 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811901E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:00:40.860802 | finish at 2025-09-10 11:54:36 + [2025-09-10 10:54:01] iteration 11274/ 11920 | consumed samples: 11544576 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.796856E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:00:32.035005 | finish at 2025-09-10 11:54:33 + [2025-09-10 10:54:06] iteration 11275/ 11920 | consumed samples: 11545600 | elapsed time per iteration (ms): 5618.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807698E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:00:23.812548 | finish at 2025-09-10 11:54:30 + [2025-09-10 10:54:12] iteration 11276/ 11920 | consumed samples: 11546624 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807139E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:00:20.185822 | finish at 2025-09-10 11:54:32 + [2025-09-10 10:54:18] iteration 11277/ 11920 | consumed samples: 11547648 | elapsed time per iteration (ms): 5617.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810362E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:00:12.313312 | finish at 2025-09-10 11:54:30 + [2025-09-10 10:54:23] iteration 11278/ 11920 | consumed samples: 11548672 | elapsed time per iteration (ms): 5616.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821100E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:00:05.650127 | finish at 2025-09-10 11:54:29 + [2025-09-10 10:54:30] iteration 11279/ 11920 | consumed samples: 11549696 | elapsed time per iteration (ms): 6331.1 | throughput per GPU (TFLOP/s/GPU): 71.3 | MFU 7.21% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802593E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:07:38.209723 | finish at 2025-09-10 12:02:08 + [2025-09-10 10:54:35] iteration 11280/ 11920 | consumed samples: 11550720 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800279E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:59:59.340973 | finish at 2025-09-10 11:54:35 + [2025-09-10 10:54:41] iteration 11281/ 11920 | consumed samples: 11551744 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813748E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:59:54.295474 | finish at 2025-09-10 11:54:35 + [2025-09-10 10:54:46] iteration 11282/ 11920 | consumed samples: 11552768 | elapsed time per iteration (ms): 5618.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808867E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:59:44.772904 | finish at 2025-09-10 11:54:31 + [2025-09-10 10:54:52] iteration 11283/ 11920 | consumed samples: 11553792 | elapsed time per iteration (ms): 5629.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.796641E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:59:46.088187 | finish at 2025-09-10 11:54:38 + [2025-09-10 10:54:58] iteration 11284/ 11920 | consumed samples: 11554816 | elapsed time per iteration (ms): 5634.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801068E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:59:43.699565 | finish at 2025-09-10 11:54:41 + [2025-09-10 10:55:03] iteration 11285/ 11920 | consumed samples: 11555840 | elapsed time per iteration (ms): 5627.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808162E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:59:33.384416 | finish at 2025-09-10 11:54:37 + [2025-09-10 10:55:09] iteration 11286/ 11920 | consumed samples: 11556864 | elapsed time per iteration (ms): 5625.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.796100E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:59:26.495328 | finish at 2025-09-10 11:54:35 + [2025-09-10 10:55:15] iteration 11287/ 11920 | consumed samples: 11557888 | elapsed time per iteration (ms): 5971.2 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811750E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:02:59.800381 | finish at 2025-09-10 11:58:15 + [2025-09-10 10:55:21] iteration 11288/ 11920 | consumed samples: 11558912 | elapsed time per iteration (ms): 5876.7 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806678E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:01:54.049791 | finish at 2025-09-10 11:57:15 + [2025-09-10 10:55:26] iteration 11289/ 11920 | consumed samples: 11559936 | elapsed time per iteration (ms): 5626.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.787991E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:59:10.596289 | finish at 2025-09-10 11:54:37 + [2025-09-10 10:55:32] iteration 11290/ 11920 | consumed samples: 11560960 | elapsed time per iteration (ms): 5618.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813518E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:58:59.562321 | finish at 2025-09-10 11:54:32 + [2025-09-10 10:55:38] iteration 11291/ 11920 | consumed samples: 11561984 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.796689E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:58:55.031066 | finish at 2025-09-10 11:54:33 + [2025-09-10 10:55:43] iteration 11292/ 11920 | consumed samples: 11563008 | elapsed time per iteration (ms): 5840.5 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.793458E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:01:07.818951 | finish at 2025-09-10 11:56:51 + [2025-09-10 10:55:49] iteration 11293/ 11920 | consumed samples: 11564032 | elapsed time per iteration (ms): 5635.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.821038E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:58:53.587629 | finish at 2025-09-10 11:54:43 + [2025-09-10 10:55:55] iteration 11294/ 11920 | consumed samples: 11565056 | elapsed time per iteration (ms): 5629.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809118E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:58:43.857846 | finish at 2025-09-10 11:54:39 + [2025-09-10 10:56:00] iteration 11295/ 11920 | consumed samples: 11566080 | elapsed time per iteration (ms): 5624.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803172E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:58:35.406102 | finish at 2025-09-10 11:54:36 + [2025-09-10 10:56:06] iteration 11296/ 11920 | consumed samples: 11567104 | elapsed time per iteration (ms): 5639.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816491E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:58:38.983967 | finish at 2025-09-10 11:54:45 + [2025-09-10 10:56:12] iteration 11297/ 11920 | consumed samples: 11568128 | elapsed time per iteration (ms): 5626.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.796976E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:58:25.101484 | finish at 2025-09-10 11:54:37 + [2025-09-10 10:56:17] iteration 11298/ 11920 | consumed samples: 11569152 | elapsed time per iteration (ms): 5843.1 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802901E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:00:34.424408 | finish at 2025-09-10 11:56:52 + [2025-09-10 10:56:23] iteration 11299/ 11920 | consumed samples: 11570176 | elapsed time per iteration (ms): 5631.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799677E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:58:17.303195 | finish at 2025-09-10 11:54:40 + [2025-09-10 10:56:29] iteration 11300/ 11920 | consumed samples: 11571200 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803266E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:58:07.185144 | finish at 2025-09-10 11:54:36 + [2025-09-10 10:56:34] iteration 11301/ 11920 | consumed samples: 11572224 | elapsed time per iteration (ms): 5637.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.791251E+00 | loss scale: 1.0 | grad norm: 0.133 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:58:09.704325 | finish at 2025-09-10 11:54:44 + [2025-09-10 10:56:40] iteration 11302/ 11920 | consumed samples: 11573248 | elapsed time per iteration (ms): 5626.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811412E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:57:56.847917 | finish at 2025-09-10 11:54:37 + [2025-09-10 10:56:46] iteration 11303/ 11920 | consumed samples: 11574272 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.790813E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:57:49.949203 | finish at 2025-09-10 11:54:36 + [2025-09-10 10:56:51] iteration 11304/ 11920 | consumed samples: 11575296 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808665E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:57:45.115730 | finish at 2025-09-10 11:54:36 + [2025-09-10 10:56:57] iteration 11305/ 11920 | consumed samples: 11576320 | elapsed time per iteration (ms): 5625.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803650E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:57:39.934530 | finish at 2025-09-10 11:54:37 + [2025-09-10 10:57:02] iteration 11306/ 11920 | consumed samples: 11577344 | elapsed time per iteration (ms): 5616.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.789245E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:57:28.371814 | finish at 2025-09-10 11:54:31 + [2025-09-10 10:57:08] iteration 11307/ 11920 | consumed samples: 11578368 | elapsed time per iteration (ms): 5618.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812079E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:57:24.123397 | finish at 2025-09-10 11:54:32 + [2025-09-10 10:57:14] iteration 11308/ 11920 | consumed samples: 11579392 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804379E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:57:21.701277 | finish at 2025-09-10 11:54:35 + [2025-09-10 10:57:19] iteration 11309/ 11920 | consumed samples: 11580416 | elapsed time per iteration (ms): 5638.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.797617E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:57:25.195448 | finish at 2025-09-10 11:54:45 + [2025-09-10 10:57:25] iteration 11310/ 11920 | consumed samples: 11581440 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808979E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:57:08.538649 | finish at 2025-09-10 11:54:34 + [2025-09-10 10:57:31] iteration 11311/ 11920 | consumed samples: 11582464 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804987E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:57:03.707965 | finish at 2025-09-10 11:54:34 + [2025-09-10 10:57:36] iteration 11312/ 11920 | consumed samples: 11583488 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799813E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:57:00.430382 | finish at 2025-09-10 11:54:37 + [2025-09-10 10:57:42] iteration 11313/ 11920 | consumed samples: 11584512 | elapsed time per iteration (ms): 5628.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814317E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:56:56.545946 | finish at 2025-09-10 11:54:38 + [2025-09-10 10:57:48] iteration 11314/ 11920 | consumed samples: 11585536 | elapsed time per iteration (ms): 6040.4 | throughput per GPU (TFLOP/s/GPU): 74.7 | MFU 7.56% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802154E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:01:00.490075 | finish at 2025-09-10 11:58:48 + [2025-09-10 10:57:54] iteration 11315/ 11920 | consumed samples: 11586560 | elapsed time per iteration (ms): 5866.7 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.791132E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:59:09.335430 | finish at 2025-09-10 11:57:03 + [2025-09-10 10:58:00] iteration 11316/ 11920 | consumed samples: 11587584 | elapsed time per iteration (ms): 5875.3 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.797627E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:59:08.685190 | finish at 2025-09-10 11:57:08 + [2025-09-10 10:58:05] iteration 11317/ 11920 | consumed samples: 11588608 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.785568E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:56:30.220968 | finish at 2025-09-10 11:54:35 + [2025-09-10 10:58:11] iteration 11318/ 11920 | consumed samples: 11589632 | elapsed time per iteration (ms): 5863.0 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.787991E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:58:49.500258 | finish at 2025-09-10 11:57:01 + [2025-09-10 10:58:17] iteration 11319/ 11920 | consumed samples: 11590656 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804009E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:56:20.928057 | finish at 2025-09-10 11:54:38 + [2025-09-10 10:58:23] iteration 11320/ 11920 | consumed samples: 11591680 | elapsed time per iteration (ms): 5840.1 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803575E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:58:24.062319 | finish at 2025-09-10 11:56:47 + [2025-09-10 10:58:28] iteration 11321/ 11920 | consumed samples: 11592704 | elapsed time per iteration (ms): 5617.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805071E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:56:04.965229 | finish at 2025-09-10 11:54:33 + [2025-09-10 10:58:34] iteration 11322/ 11920 | consumed samples: 11593728 | elapsed time per iteration (ms): 5617.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.789103E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:55:59.012113 | finish at 2025-09-10 11:54:33 + [2025-09-10 10:58:39] iteration 11323/ 11920 | consumed samples: 11594752 | elapsed time per iteration (ms): 5625.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801894E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:55:58.361989 | finish at 2025-09-10 11:54:38 + [2025-09-10 10:58:45] iteration 11324/ 11920 | consumed samples: 11595776 | elapsed time per iteration (ms): 5623.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815366E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:55:51.301408 | finish at 2025-09-10 11:54:36 + [2025-09-10 10:58:51] iteration 11325/ 11920 | consumed samples: 11596800 | elapsed time per iteration (ms): 5645.4 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808751E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:55:59.040691 | finish at 2025-09-10 11:54:50 + [2025-09-10 10:58:56] iteration 11326/ 11920 | consumed samples: 11597824 | elapsed time per iteration (ms): 5624.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.792433E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:55:41.001031 | finish at 2025-09-10 11:54:37 + [2025-09-10 10:59:02] iteration 11327/ 11920 | consumed samples: 11598848 | elapsed time per iteration (ms): 5617.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806961E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:55:30.927576 | finish at 2025-09-10 11:54:33 + [2025-09-10 10:59:08] iteration 11328/ 11920 | consumed samples: 11599872 | elapsed time per iteration (ms): 5617.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808611E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:55:25.362438 | finish at 2025-09-10 11:54:33 + [2025-09-10 10:59:13] iteration 11329/ 11920 | consumed samples: 11600896 | elapsed time per iteration (ms): 5614.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.793435E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:55:18.347772 | finish at 2025-09-10 11:54:32 + [2025-09-10 10:59:19] iteration 11330/ 11920 | consumed samples: 11601920 | elapsed time per iteration (ms): 5627.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.792010E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:55:20.426750 | finish at 2025-09-10 11:54:39 + [2025-09-10 10:59:24] iteration 11331/ 11920 | consumed samples: 11602944 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.797509E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:55:10.436917 | finish at 2025-09-10 11:54:35 + [2025-09-10 10:59:30] iteration 11332/ 11920 | consumed samples: 11603968 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.790676E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:55:07.493691 | finish at 2025-09-10 11:54:38 + [2025-09-10 10:59:36] iteration 11333/ 11920 | consumed samples: 11604992 | elapsed time per iteration (ms): 6244.9 | throughput per GPU (TFLOP/s/GPU): 72.3 | MFU 7.31% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.790854E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 1:01:05.733761 | finish at 2025-09-10 12:00:42 + [2025-09-10 10:59:42] iteration 11334/ 11920 | consumed samples: 11606016 | elapsed time per iteration (ms): 5626.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.794964E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:54:56.887232 | finish at 2025-09-10 11:54:39 + [2025-09-10 10:59:48] iteration 11335/ 11920 | consumed samples: 11607040 | elapsed time per iteration (ms): 5634.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808683E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:54:56.004825 | finish at 2025-09-10 11:54:44 + [2025-09-10 10:59:53] iteration 11336/ 11920 | consumed samples: 11608064 | elapsed time per iteration (ms): 5858.7 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798653E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:57:01.468987 | finish at 2025-09-10 11:56:55 + [2025-09-10 10:59:59] iteration 11337/ 11920 | consumed samples: 11609088 | elapsed time per iteration (ms): 5628.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.790165E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:54:41.312355 | finish at 2025-09-10 11:54:40 + [2025-09-10 11:00:05] iteration 11338/ 11920 | consumed samples: 11610112 | elapsed time per iteration (ms): 5627.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798923E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:54:34.948606 | finish at 2025-09-10 11:54:40 + [2025-09-10 11:00:10] iteration 11339/ 11920 | consumed samples: 11611136 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806491E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:54:25.124077 | finish at 2025-09-10 11:54:35 + [2025-09-10 11:00:16] iteration 11340/ 11920 | consumed samples: 11612160 | elapsed time per iteration (ms): 5628.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.784238E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:54:24.430704 | finish at 2025-09-10 11:54:40 + [2025-09-10 11:00:22] iteration 11341/ 11920 | consumed samples: 11613184 | elapsed time per iteration (ms): 5910.5 | throughput per GPU (TFLOP/s/GPU): 76.4 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.789466E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:57:02.182842 | finish at 2025-09-10 11:57:24 + [2025-09-10 11:00:27] iteration 11342/ 11920 | consumed samples: 11614208 | elapsed time per iteration (ms): 5626.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.795101E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:54:12.265216 | finish at 2025-09-10 11:54:40 + [2025-09-10 11:00:33] iteration 11343/ 11920 | consumed samples: 11615232 | elapsed time per iteration (ms): 5993.9 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816240E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:57:38.486938 | finish at 2025-09-10 11:58:12 + [2025-09-10 11:00:39] iteration 11344/ 11920 | consumed samples: 11616256 | elapsed time per iteration (ms): 5618.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.795076E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:53:56.358444 | finish at 2025-09-10 11:54:35 + [2025-09-10 11:00:45] iteration 11345/ 11920 | consumed samples: 11617280 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817797E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:53:51.666499 | finish at 2025-09-10 11:54:36 + [2025-09-10 11:00:50] iteration 11346/ 11920 | consumed samples: 11618304 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.797135E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:53:51.204719 | finish at 2025-09-10 11:54:42 + [2025-09-10 11:00:56] iteration 11347/ 11920 | consumed samples: 11619328 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809241E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:53:43.398774 | finish at 2025-09-10 11:54:39 + [2025-09-10 11:01:02] iteration 11348/ 11920 | consumed samples: 11620352 | elapsed time per iteration (ms): 5618.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.791630E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:53:33.660759 | finish at 2025-09-10 11:54:35 + [2025-09-10 11:01:07] iteration 11349/ 11920 | consumed samples: 11621376 | elapsed time per iteration (ms): 5831.5 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814073E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:55:29.783944 | finish at 2025-09-10 11:56:37 + [2025-09-10 11:01:13] iteration 11350/ 11920 | consumed samples: 11622400 | elapsed time per iteration (ms): 5626.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807848E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:53:27.345479 | finish at 2025-09-10 11:54:40 + [2025-09-10 11:01:19] iteration 11351/ 11920 | consumed samples: 11623424 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800088E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:53:17.991836 | finish at 2025-09-10 11:54:37 + [2025-09-10 11:01:24] iteration 11352/ 11920 | consumed samples: 11624448 | elapsed time per iteration (ms): 5619.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.790477E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:53:11.650885 | finish at 2025-09-10 11:54:36 + [2025-09-10 11:01:30] iteration 11353/ 11920 | consumed samples: 11625472 | elapsed time per iteration (ms): 5621.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810650E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:53:07.353604 | finish at 2025-09-10 11:54:37 + [2025-09-10 11:01:36] iteration 11354/ 11920 | consumed samples: 11626496 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804578E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:53:00.917911 | finish at 2025-09-10 11:54:36 + [2025-09-10 11:01:41] iteration 11355/ 11920 | consumed samples: 11627520 | elapsed time per iteration (ms): 5628.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809967E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:53:00.183719 | finish at 2025-09-10 11:54:41 + [2025-09-10 11:01:47] iteration 11356/ 11920 | consumed samples: 11628544 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.791660E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:52:54.903348 | finish at 2025-09-10 11:54:42 + [2025-09-10 11:01:53] iteration 11357/ 11920 | consumed samples: 11629568 | elapsed time per iteration (ms): 5878.2 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799286E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:55:09.406496 | finish at 2025-09-10 11:57:02 + [2025-09-10 11:01:58] iteration 11358/ 11920 | consumed samples: 11630592 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808081E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:52:41.712538 | finish at 2025-09-10 11:54:40 + [2025-09-10 11:02:04] iteration 11359/ 11920 | consumed samples: 11631616 | elapsed time per iteration (ms): 5615.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807044E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:52:30.092047 | finish at 2025-09-10 11:54:34 + [2025-09-10 11:02:10] iteration 11360/ 11920 | consumed samples: 11632640 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811892E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:52:28.462048 | finish at 2025-09-10 11:54:38 + [2025-09-10 11:02:15] iteration 11361/ 11920 | consumed samples: 11633664 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806447E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:52:23.679299 | finish at 2025-09-10 11:54:39 + [2025-09-10 11:02:21] iteration 11362/ 11920 | consumed samples: 11634688 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809589E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:52:17.145035 | finish at 2025-09-10 11:54:38 + [2025-09-10 11:02:26] iteration 11363/ 11920 | consumed samples: 11635712 | elapsed time per iteration (ms): 5619.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802556E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:52:10.127458 | finish at 2025-09-10 11:54:37 + [2025-09-10 11:02:32] iteration 11364/ 11920 | consumed samples: 11636736 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802697E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:52:06.696549 | finish at 2025-09-10 11:54:39 + [2025-09-10 11:02:38] iteration 11365/ 11920 | consumed samples: 11637760 | elapsed time per iteration (ms): 5830.7 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.797120E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:53:56.031708 | finish at 2025-09-10 11:56:34 + [2025-09-10 11:02:43] iteration 11366/ 11920 | consumed samples: 11638784 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798997E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 10.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:51:54.720601 | finish at 2025-09-10 11:54:38 + [2025-09-10 11:02:49] iteration 11367/ 11920 | consumed samples: 11639808 | elapsed time per iteration (ms): 5636.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.795661E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:51:57.086615 | finish at 2025-09-10 11:54:46 + [2025-09-10 11:02:55] iteration 11368/ 11920 | consumed samples: 11640832 | elapsed time per iteration (ms): 5645.4 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808681E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:51:56.260563 | finish at 2025-09-10 11:54:51 + [2025-09-10 11:03:00] iteration 11369/ 11920 | consumed samples: 11641856 | elapsed time per iteration (ms): 5626.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.794384E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:51:40.017655 | finish at 2025-09-10 11:54:40 + [2025-09-10 11:03:06] iteration 11370/ 11920 | consumed samples: 11642880 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.790083E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:51:33.428338 | finish at 2025-09-10 11:54:39 + [2025-09-10 11:03:12] iteration 11371/ 11920 | consumed samples: 11643904 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811118E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:51:26.095654 | finish at 2025-09-10 11:54:38 + [2025-09-10 11:03:17] iteration 11372/ 11920 | consumed samples: 11644928 | elapsed time per iteration (ms): 5622.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802505E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:51:21.207838 | finish at 2025-09-10 11:54:38 + [2025-09-10 11:03:23] iteration 11373/ 11920 | consumed samples: 11645952 | elapsed time per iteration (ms): 5969.9 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.790030E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:54:25.526112 | finish at 2025-09-10 11:57:49 + [2025-09-10 11:03:29] iteration 11374/ 11920 | consumed samples: 11646976 | elapsed time per iteration (ms): 5916.3 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801171E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:53:50.301264 | finish at 2025-09-10 11:57:19 + [2025-09-10 11:03:35] iteration 11375/ 11920 | consumed samples: 11648000 | elapsed time per iteration (ms): 5628.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812696E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:51:07.485584 | finish at 2025-09-10 11:54:42 + [2025-09-10 11:03:40] iteration 11376/ 11920 | consumed samples: 11649024 | elapsed time per iteration (ms): 5629.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.793932E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:51:02.580116 | finish at 2025-09-10 11:54:43 + [2025-09-10 11:03:46] iteration 11377/ 11920 | consumed samples: 11650048 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.791651E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:50:53.216063 | finish at 2025-09-10 11:54:39 + [2025-09-10 11:03:52] iteration 11378/ 11920 | consumed samples: 11651072 | elapsed time per iteration (ms): 5632.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804701E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:50:52.805918 | finish at 2025-09-10 11:54:44 + [2025-09-10 11:03:57] iteration 11379/ 11920 | consumed samples: 11652096 | elapsed time per iteration (ms): 5626.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.796261E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:50:44.131337 | finish at 2025-09-10 11:54:41 + [2025-09-10 11:04:03] iteration 11380/ 11920 | consumed samples: 11653120 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.790944E+00 | loss scale: 1.0 | grad norm: 0.136 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:50:35.151544 | finish at 2025-09-10 11:54:38 + [2025-09-10 11:04:09] iteration 11381/ 11920 | consumed samples: 11654144 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805099E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:50:29.513416 | finish at 2025-09-10 11:54:38 + [2025-09-10 11:04:14] iteration 11382/ 11920 | consumed samples: 11655168 | elapsed time per iteration (ms): 5614.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808413E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:50:20.417343 | finish at 2025-09-10 11:54:35 + [2025-09-10 11:04:20] iteration 11383/ 11920 | consumed samples: 11656192 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.786635E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:50:19.668482 | finish at 2025-09-10 11:54:39 + [2025-09-10 11:04:25] iteration 11384/ 11920 | consumed samples: 11657216 | elapsed time per iteration (ms): 5621.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805896E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:50:13.145094 | finish at 2025-09-10 11:54:39 + [2025-09-10 11:04:31] iteration 11385/ 11920 | consumed samples: 11658240 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803017E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:50:07.199695 | finish at 2025-09-10 11:54:38 + [2025-09-10 11:04:37] iteration 11386/ 11920 | consumed samples: 11659264 | elapsed time per iteration (ms): 5852.6 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.786511E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:52:05.307168 | finish at 2025-09-10 11:56:42 + [2025-09-10 11:04:43] iteration 11387/ 11920 | consumed samples: 11660288 | elapsed time per iteration (ms): 5955.4 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.780968E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:52:54.202398 | finish at 2025-09-10 11:57:37 + [2025-09-10 11:04:48] iteration 11388/ 11920 | consumed samples: 11661312 | elapsed time per iteration (ms): 5630.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808032E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:49:55.394205 | finish at 2025-09-10 11:54:44 + [2025-09-10 11:04:54] iteration 11389/ 11920 | consumed samples: 11662336 | elapsed time per iteration (ms): 5629.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802271E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:49:49.413589 | finish at 2025-09-10 11:54:43 + [2025-09-10 11:05:00] iteration 11390/ 11920 | consumed samples: 11663360 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808956E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:49:41.089900 | finish at 2025-09-10 11:54:41 + [2025-09-10 11:05:06] iteration 11391/ 11920 | consumed samples: 11664384 | elapsed time per iteration (ms): 5921.8 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.816295E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:52:12.613858 | finish at 2025-09-10 11:57:18 + [2025-09-10 11:05:11] iteration 11392/ 11920 | consumed samples: 11665408 | elapsed time per iteration (ms): 5856.3 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.817520E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:51:32.130489 | finish at 2025-09-10 11:56:44 + [2025-09-10 11:05:17] iteration 11393/ 11920 | consumed samples: 11666432 | elapsed time per iteration (ms): 5978.7 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802625E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:52:30.758902 | finish at 2025-09-10 11:57:48 + [2025-09-10 11:05:23] iteration 11394/ 11920 | consumed samples: 11667456 | elapsed time per iteration (ms): 5948.9 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.793417E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:52:09.112864 | finish at 2025-09-10 11:57:33 + [2025-09-10 11:05:29] iteration 11395/ 11920 | consumed samples: 11668480 | elapsed time per iteration (ms): 5619.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798812E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:49:10.033307 | finish at 2025-09-10 11:54:39 + [2025-09-10 11:05:35] iteration 11396/ 11920 | consumed samples: 11669504 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814421E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:49:05.564689 | finish at 2025-09-10 11:54:40 + [2025-09-10 11:05:40] iteration 11397/ 11920 | consumed samples: 11670528 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.822705E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:49:00.535299 | finish at 2025-09-10 11:54:41 + [2025-09-10 11:05:46] iteration 11398/ 11920 | consumed samples: 11671552 | elapsed time per iteration (ms): 5622.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804523E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:48:54.931653 | finish at 2025-09-10 11:54:41 + [2025-09-10 11:05:52] iteration 11399/ 11920 | consumed samples: 11672576 | elapsed time per iteration (ms): 5634.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806980E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 10.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:48:55.764068 | finish at 2025-09-10 11:54:47 + [2025-09-10 11:05:57] iteration 11400/ 11920 | consumed samples: 11673600 | elapsed time per iteration (ms): 5884.3 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801533E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:50:59.857569 | finish at 2025-09-10 11:56:57 + [2025-09-10 11:06:03] iteration 11401/ 11920 | consumed samples: 11674624 | elapsed time per iteration (ms): 5628.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808386E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:48:40.965668 | finish at 2025-09-10 11:54:44 + [2025-09-10 11:06:09] iteration 11402/ 11920 | consumed samples: 11675648 | elapsed time per iteration (ms): 5639.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.784513E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:48:41.411991 | finish at 2025-09-10 11:54:50 + [2025-09-10 11:06:15] iteration 11403/ 11920 | consumed samples: 11676672 | elapsed time per iteration (ms): 5875.7 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800632E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:50:37.740473 | finish at 2025-09-10 11:56:52 + [2025-09-10 11:06:20] iteration 11404/ 11920 | consumed samples: 11677696 | elapsed time per iteration (ms): 5638.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806252E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:48:29.276161 | finish at 2025-09-10 11:54:49 + [2025-09-10 11:06:26] iteration 11405/ 11920 | consumed samples: 11678720 | elapsed time per iteration (ms): 6002.6 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.61% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801578E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:51:31.339713 | finish at 2025-09-10 11:57:58 + [2025-09-10 11:06:32] iteration 11406/ 11920 | consumed samples: 11679744 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807313E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:48:11.491786 | finish at 2025-09-10 11:54:43 + [2025-09-10 11:06:37] iteration 11407/ 11920 | consumed samples: 11680768 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802328E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:48:05.111793 | finish at 2025-09-10 11:54:43 + [2025-09-10 11:06:43] iteration 11408/ 11920 | consumed samples: 11681792 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805263E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:47:57.846802 | finish at 2025-09-10 11:54:41 + [2025-09-10 11:06:49] iteration 11409/ 11920 | consumed samples: 11682816 | elapsed time per iteration (ms): 5633.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815997E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:47:58.441139 | finish at 2025-09-10 11:54:47 + [2025-09-10 11:06:54] iteration 11410/ 11920 | consumed samples: 11683840 | elapsed time per iteration (ms): 5638.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.791891E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:47:55.615168 | finish at 2025-09-10 11:54:50 + [2025-09-10 11:07:00] iteration 11411/ 11920 | consumed samples: 11684864 | elapsed time per iteration (ms): 5628.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.789039E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:47:44.976029 | finish at 2025-09-10 11:54:45 + [2025-09-10 11:07:06] iteration 11412/ 11920 | consumed samples: 11685888 | elapsed time per iteration (ms): 5632.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812313E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:47:41.238023 | finish at 2025-09-10 11:54:47 + [2025-09-10 11:07:11] iteration 11413/ 11920 | consumed samples: 11686912 | elapsed time per iteration (ms): 5634.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.794060E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:47:36.847326 | finish at 2025-09-10 11:54:48 + [2025-09-10 11:07:17] iteration 11414/ 11920 | consumed samples: 11687936 | elapsed time per iteration (ms): 6007.5 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803255E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:50:39.793157 | finish at 2025-09-10 11:57:57 + [2025-09-10 11:07:23] iteration 11415/ 11920 | consumed samples: 11688960 | elapsed time per iteration (ms): 5634.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808212E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:47:25.168226 | finish at 2025-09-10 11:54:48 + [2025-09-10 11:07:29] iteration 11416/ 11920 | consumed samples: 11689984 | elapsed time per iteration (ms): 5627.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.795337E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:47:16.096247 | finish at 2025-09-10 11:54:45 + [2025-09-10 11:07:34] iteration 11417/ 11920 | consumed samples: 11691008 | elapsed time per iteration (ms): 5635.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.824048E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:47:14.840921 | finish at 2025-09-10 11:54:49 + [2025-09-10 11:07:40] iteration 11418/ 11920 | consumed samples: 11692032 | elapsed time per iteration (ms): 5626.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799723E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:47:04.233293 | finish at 2025-09-10 11:54:44 + [2025-09-10 11:07:45] iteration 11419/ 11920 | consumed samples: 11693056 | elapsed time per iteration (ms): 5631.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801458E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:47:01.235538 | finish at 2025-09-10 11:54:47 + [2025-09-10 11:07:51] iteration 11420/ 11920 | consumed samples: 11694080 | elapsed time per iteration (ms): 5630.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809371E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:46:55.385699 | finish at 2025-09-10 11:54:46 + [2025-09-10 11:07:57] iteration 11421/ 11920 | consumed samples: 11695104 | elapsed time per iteration (ms): 5638.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.792785E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:46:53.354392 | finish at 2025-09-10 11:54:50 + [2025-09-10 11:08:03] iteration 11422/ 11920 | consumed samples: 11696128 | elapsed time per iteration (ms): 5966.9 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.796546E+00 | loss scale: 1.0 | grad norm: 0.233 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:49:31.537864 | finish at 2025-09-10 11:57:34 + [2025-09-10 11:08:08] iteration 11423/ 11920 | consumed samples: 11697152 | elapsed time per iteration (ms): 5638.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805110E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:46:42.295029 | finish at 2025-09-10 11:54:51 + [2025-09-10 11:08:14] iteration 11424/ 11920 | consumed samples: 11698176 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807378E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:46:28.393379 | finish at 2025-09-10 11:54:42 + [2025-09-10 11:08:20] iteration 11425/ 11920 | consumed samples: 11699200 | elapsed time per iteration (ms): 6066.5 | throughput per GPU (TFLOP/s/GPU): 74.4 | MFU 7.53% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802171E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:50:02.926916 | finish at 2025-09-10 11:58:23 + [2025-09-10 11:08:26] iteration 11426/ 11920 | consumed samples: 11700224 | elapsed time per iteration (ms): 5629.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803116E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:46:20.734690 | finish at 2025-09-10 11:54:46 + [2025-09-10 11:08:31] iteration 11427/ 11920 | consumed samples: 11701248 | elapsed time per iteration (ms): 5858.9 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798013E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:48:08.417283 | finish at 2025-09-10 11:56:40 + [2025-09-10 11:08:38] iteration 11428/ 11920 | consumed samples: 11702272 | elapsed time per iteration (ms): 6331.5 | throughput per GPU (TFLOP/s/GPU): 71.3 | MFU 7.21% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802433E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:51:55.091105 | finish at 2025-09-10 12:00:33 + [2025-09-10 11:08:44] iteration 11429/ 11920 | consumed samples: 11703296 | elapsed time per iteration (ms): 5868.2 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805344E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:48:01.290094 | finish at 2025-09-10 11:56:45 + [2025-09-10 11:08:49] iteration 11430/ 11920 | consumed samples: 11704320 | elapsed time per iteration (ms): 5632.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802092E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:45:59.902654 | finish at 2025-09-10 11:54:49 + [2025-09-10 11:08:55] iteration 11431/ 11920 | consumed samples: 11705344 | elapsed time per iteration (ms): 5641.5 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808669E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:45:58.704457 | finish at 2025-09-10 11:54:54 + [2025-09-10 11:09:01] iteration 11432/ 11920 | consumed samples: 11706368 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805071E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:45:43.703299 | finish at 2025-09-10 11:54:44 + [2025-09-10 11:09:06] iteration 11433/ 11920 | consumed samples: 11707392 | elapsed time per iteration (ms): 5618.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803211E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:45:35.968569 | finish at 2025-09-10 11:54:42 + [2025-09-10 11:09:12] iteration 11434/ 11920 | consumed samples: 11708416 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804856E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:45:33.899011 | finish at 2025-09-10 11:54:46 + [2025-09-10 11:09:17] iteration 11435/ 11920 | consumed samples: 11709440 | elapsed time per iteration (ms): 5623.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810566E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:45:27.225606 | finish at 2025-09-10 11:54:45 + [2025-09-10 11:09:23] iteration 11436/ 11920 | consumed samples: 11710464 | elapsed time per iteration (ms): 5619.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802742E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:45:19.798497 | finish at 2025-09-10 11:54:43 + [2025-09-10 11:09:29] iteration 11437/ 11920 | consumed samples: 11711488 | elapsed time per iteration (ms): 5616.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799696E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:45:12.942071 | finish at 2025-09-10 11:54:42 + [2025-09-10 11:09:35] iteration 11438/ 11920 | consumed samples: 11712512 | elapsed time per iteration (ms): 6237.4 | throughput per GPU (TFLOP/s/GPU): 72.4 | MFU 7.32% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.788872E+00 | loss scale: 1.0 | grad norm: 0.213 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:50:06.403153 | finish at 2025-09-10 11:59:41 + [2025-09-10 11:09:41] iteration 11439/ 11920 | consumed samples: 11713536 | elapsed time per iteration (ms): 5627.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807046E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:45:06.604132 | finish at 2025-09-10 11:54:47 + [2025-09-10 11:09:46] iteration 11440/ 11920 | consumed samples: 11714560 | elapsed time per iteration (ms): 5630.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807314E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:45:02.491608 | finish at 2025-09-10 11:54:49 + [2025-09-10 11:09:52] iteration 11441/ 11920 | consumed samples: 11715584 | elapsed time per iteration (ms): 5629.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.793051E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:44:56.641577 | finish at 2025-09-10 11:54:48 + [2025-09-10 11:09:57] iteration 11442/ 11920 | consumed samples: 11716608 | elapsed time per iteration (ms): 5627.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808791E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:44:49.916536 | finish at 2025-09-10 11:54:47 + [2025-09-10 11:10:03] iteration 11443/ 11920 | consumed samples: 11717632 | elapsed time per iteration (ms): 5622.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807138E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:44:41.923716 | finish at 2025-09-10 11:54:45 + [2025-09-10 11:10:09] iteration 11444/ 11920 | consumed samples: 11718656 | elapsed time per iteration (ms): 5628.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.792519E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:44:39.350977 | finish at 2025-09-10 11:54:48 + [2025-09-10 11:10:14] iteration 11445/ 11920 | consumed samples: 11719680 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.791903E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:44:31.978849 | finish at 2025-09-10 11:54:46 + [2025-09-10 11:10:20] iteration 11446/ 11920 | consumed samples: 11720704 | elapsed time per iteration (ms): 5982.5 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.797557E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:47:15.725265 | finish at 2025-09-10 11:57:36 + [2025-09-10 11:10:26] iteration 11447/ 11920 | consumed samples: 11721728 | elapsed time per iteration (ms): 5619.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800128E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:44:18.102065 | finish at 2025-09-10 11:54:44 + [2025-09-10 11:10:32] iteration 11448/ 11920 | consumed samples: 11722752 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.791627E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:44:13.674017 | finish at 2025-09-10 11:54:45 + [2025-09-10 11:10:38] iteration 11449/ 11920 | consumed samples: 11723776 | elapsed time per iteration (ms): 5993.8 | throughput per GPU (TFLOP/s/GPU): 75.3 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798022E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:47:03.062471 | finish at 2025-09-10 11:57:41 + [2025-09-10 11:10:43] iteration 11450/ 11920 | consumed samples: 11724800 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802360E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:44:02.504153 | finish at 2025-09-10 11:54:46 + [2025-09-10 11:10:49] iteration 11451/ 11920 | consumed samples: 11725824 | elapsed time per iteration (ms): 5627.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.782991E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:43:59.176092 | finish at 2025-09-10 11:54:48 + [2025-09-10 11:10:54] iteration 11452/ 11920 | consumed samples: 11726848 | elapsed time per iteration (ms): 5630.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.795485E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:43:55.041121 | finish at 2025-09-10 11:54:49 + [2025-09-10 11:11:00] iteration 11453/ 11920 | consumed samples: 11727872 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.786632E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:43:46.897157 | finish at 2025-09-10 11:54:47 + [2025-09-10 11:11:06] iteration 11454/ 11920 | consumed samples: 11728896 | elapsed time per iteration (ms): 5630.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.787697E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:43:43.591386 | finish at 2025-09-10 11:54:49 + [2025-09-10 11:11:11] iteration 11455/ 11920 | consumed samples: 11729920 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.790707E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:43:34.249946 | finish at 2025-09-10 11:54:46 + [2025-09-10 11:11:17] iteration 11456/ 11920 | consumed samples: 11730944 | elapsed time per iteration (ms): 5621.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.795580E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:43:28.133293 | finish at 2025-09-10 11:54:45 + [2025-09-10 11:11:23] iteration 11457/ 11920 | consumed samples: 11731968 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.795053E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:43:23.824827 | finish at 2025-09-10 11:54:46 + [2025-09-10 11:11:28] iteration 11458/ 11920 | consumed samples: 11732992 | elapsed time per iteration (ms): 5628.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.790348E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:43:20.463814 | finish at 2025-09-10 11:54:49 + [2025-09-10 11:11:34] iteration 11459/ 11920 | consumed samples: 11734016 | elapsed time per iteration (ms): 5976.5 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.772084E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:45:55.151018 | finish at 2025-09-10 11:57:29 + [2025-09-10 11:11:40] iteration 11460/ 11920 | consumed samples: 11735040 | elapsed time per iteration (ms): 5829.9 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.797301E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:44:41.751490 | finish at 2025-09-10 11:56:22 + [2025-09-10 11:11:46] iteration 11461/ 11920 | consumed samples: 11736064 | elapsed time per iteration (ms): 5629.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.787472E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:43:04.119385 | finish at 2025-09-10 11:54:50 + [2025-09-10 11:11:51] iteration 11462/ 11920 | consumed samples: 11737088 | elapsed time per iteration (ms): 5638.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.789822E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:43:02.295620 | finish at 2025-09-10 11:54:54 + [2025-09-10 11:11:57] iteration 11463/ 11920 | consumed samples: 11738112 | elapsed time per iteration (ms): 5635.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.789258E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:42:55.392644 | finish at 2025-09-10 11:54:52 + [2025-09-10 11:12:03] iteration 11464/ 11920 | consumed samples: 11739136 | elapsed time per iteration (ms): 5830.0 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.790114E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:44:18.500622 | finish at 2025-09-10 11:56:21 + [2025-09-10 11:12:08] iteration 11465/ 11920 | consumed samples: 11740160 | elapsed time per iteration (ms): 5619.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.783812E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:42:36.928983 | finish at 2025-09-10 11:54:45 + [2025-09-10 11:12:14] iteration 11466/ 11920 | consumed samples: 11741184 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.794356E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:42:32.458889 | finish at 2025-09-10 11:54:46 + [2025-09-10 11:12:20] iteration 11467/ 11920 | consumed samples: 11742208 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.789278E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:42:28.013216 | finish at 2025-09-10 11:54:48 + [2025-09-10 11:12:25] iteration 11468/ 11920 | consumed samples: 11743232 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801844E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:42:21.549403 | finish at 2025-09-10 11:54:47 + [2025-09-10 11:12:31] iteration 11469/ 11920 | consumed samples: 11744256 | elapsed time per iteration (ms): 5622.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.794296E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:42:15.499948 | finish at 2025-09-10 11:54:46 + [2025-09-10 11:12:36] iteration 11470/ 11920 | consumed samples: 11745280 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.795570E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:42:09.822421 | finish at 2025-09-10 11:54:46 + [2025-09-10 11:12:42] iteration 11471/ 11920 | consumed samples: 11746304 | elapsed time per iteration (ms): 5634.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.785081E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:42:09.850796 | finish at 2025-09-10 11:54:52 + [2025-09-10 11:12:48] iteration 11472/ 11920 | consumed samples: 11747328 | elapsed time per iteration (ms): 5622.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800232E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:41:58.889053 | finish at 2025-09-10 11:54:47 + [2025-09-10 11:12:54] iteration 11473/ 11920 | consumed samples: 11748352 | elapsed time per iteration (ms): 5983.0 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.795105E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:44:34.394091 | finish at 2025-09-10 11:57:28 + [2025-09-10 11:12:59] iteration 11474/ 11920 | consumed samples: 11749376 | elapsed time per iteration (ms): 5632.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.790642E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:41:52.209599 | finish at 2025-09-10 11:54:51 + [2025-09-10 11:13:05] iteration 11475/ 11920 | consumed samples: 11750400 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.796165E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:41:42.357287 | finish at 2025-09-10 11:54:47 + [2025-09-10 11:13:11] iteration 11476/ 11920 | consumed samples: 11751424 | elapsed time per iteration (ms): 5623.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.795375E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:41:36.626567 | finish at 2025-09-10 11:54:47 + [2025-09-10 11:13:16] iteration 11477/ 11920 | consumed samples: 11752448 | elapsed time per iteration (ms): 5940.2 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803923E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:43:51.510964 | finish at 2025-09-10 11:57:08 + [2025-09-10 11:13:22] iteration 11478/ 11920 | consumed samples: 11753472 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.795614E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:41:24.409310 | finish at 2025-09-10 11:54:46 + [2025-09-10 11:13:28] iteration 11479/ 11920 | consumed samples: 11754496 | elapsed time per iteration (ms): 6020.5 | throughput per GPU (TFLOP/s/GPU): 75.0 | MFU 7.58% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802656E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:44:15.022707 | finish at 2025-09-10 11:57:43 + [2025-09-10 11:13:34] iteration 11480/ 11920 | consumed samples: 11755520 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810049E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:41:13.837032 | finish at 2025-09-10 11:54:48 + [2025-09-10 11:13:39] iteration 11481/ 11920 | consumed samples: 11756544 | elapsed time per iteration (ms): 5623.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800141E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:41:08.496854 | finish at 2025-09-10 11:54:48 + [2025-09-10 11:13:45] iteration 11482/ 11920 | consumed samples: 11757568 | elapsed time per iteration (ms): 5628.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802549E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:41:05.225558 | finish at 2025-09-10 11:54:50 + [2025-09-10 11:13:51] iteration 11483/ 11920 | consumed samples: 11758592 | elapsed time per iteration (ms): 5630.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.789911E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:41:00.648456 | finish at 2025-09-10 11:54:51 + [2025-09-10 11:13:56] iteration 11484/ 11920 | consumed samples: 11759616 | elapsed time per iteration (ms): 5631.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.783179E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:40:55.325790 | finish at 2025-09-10 11:54:52 + [2025-09-10 11:14:02] iteration 11485/ 11920 | consumed samples: 11760640 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798227E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:40:46.570812 | finish at 2025-09-10 11:54:48 + [2025-09-10 11:14:07] iteration 11486/ 11920 | consumed samples: 11761664 | elapsed time per iteration (ms): 5624.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.796862E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:40:40.995455 | finish at 2025-09-10 11:54:48 + [2025-09-10 11:14:13] iteration 11487/ 11920 | consumed samples: 11762688 | elapsed time per iteration (ms): 5617.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.795932E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:40:32.562836 | finish at 2025-09-10 11:54:46 + [2025-09-10 11:14:19] iteration 11488/ 11920 | consumed samples: 11763712 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.783563E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:40:30.285301 | finish at 2025-09-10 11:54:49 + [2025-09-10 11:14:24] iteration 11489/ 11920 | consumed samples: 11764736 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802832E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:40:23.366426 | finish at 2025-09-10 11:54:48 + [2025-09-10 11:14:30] iteration 11490/ 11920 | consumed samples: 11765760 | elapsed time per iteration (ms): 5900.5 | throughput per GPU (TFLOP/s/GPU): 76.5 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.791610E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:42:17.218306 | finish at 2025-09-10 11:56:47 + [2025-09-10 11:14:36] iteration 11491/ 11920 | consumed samples: 11766784 | elapsed time per iteration (ms): 5617.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.793400E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:40:09.859252 | finish at 2025-09-10 11:54:46 + [2025-09-10 11:14:42] iteration 11492/ 11920 | consumed samples: 11767808 | elapsed time per iteration (ms): 5626.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809593E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:40:08.179097 | finish at 2025-09-10 11:54:50 + [2025-09-10 11:14:47] iteration 11493/ 11920 | consumed samples: 11768832 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805407E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:40:00.057378 | finish at 2025-09-10 11:54:47 + [2025-09-10 11:14:53] iteration 11494/ 11920 | consumed samples: 11769856 | elapsed time per iteration (ms): 5856.1 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805313E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:41:34.700362 | finish at 2025-09-10 11:56:28 + [2025-09-10 11:14:59] iteration 11495/ 11920 | consumed samples: 11770880 | elapsed time per iteration (ms): 5645.4 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.797055E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:39:59.302518 | finish at 2025-09-10 11:54:58 + [2025-09-10 11:15:04] iteration 11496/ 11920 | consumed samples: 11771904 | elapsed time per iteration (ms): 5630.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.797213E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:39:47.186464 | finish at 2025-09-10 11:54:51 + [2025-09-10 11:15:10] iteration 11497/ 11920 | consumed samples: 11772928 | elapsed time per iteration (ms): 5616.0 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803002E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:39:35.554157 | finish at 2025-09-10 11:54:45 + [2025-09-10 11:15:15] iteration 11498/ 11920 | consumed samples: 11773952 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801092E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:39:32.836538 | finish at 2025-09-10 11:54:48 + [2025-09-10 11:15:21] iteration 11499/ 11920 | consumed samples: 11774976 | elapsed time per iteration (ms): 5909.5 | throughput per GPU (TFLOP/s/GPU): 76.4 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.785899E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:41:27.905674 | finish at 2025-09-10 11:56:49 + [2025-09-10 11:15:27] iteration 11500/ 11920 | consumed samples: 11776000 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.797986E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:39:21.370468 | finish at 2025-09-10 11:54:48 + [2025-09-10 11:15:33] iteration 11501/ 11920 | consumed samples: 11777024 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.790051E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:39:16.456530 | finish at 2025-09-10 11:54:49 + [2025-09-10 11:15:39] iteration 11502/ 11920 | consumed samples: 11778048 | elapsed time per iteration (ms): 5858.8 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.788752E+00 | loss scale: 1.0 | grad norm: 0.141 | num zeros: 10.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:40:48.987641 | finish at 2025-09-10 11:56:28 + [2025-09-10 11:15:44] iteration 11503/ 11920 | consumed samples: 11779072 | elapsed time per iteration (ms): 5889.7 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801103E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:40:56.001791 | finish at 2025-09-10 11:56:40 + [2025-09-10 11:15:50] iteration 11504/ 11920 | consumed samples: 11780096 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806805E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:39:00.265709 | finish at 2025-09-10 11:54:50 + [2025-09-10 11:15:56] iteration 11505/ 11920 | consumed samples: 11781120 | elapsed time per iteration (ms): 5847.6 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799743E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:40:26.734996 | finish at 2025-09-10 11:56:23 + [2025-09-10 11:16:02] iteration 11506/ 11920 | consumed samples: 11782144 | elapsed time per iteration (ms): 5626.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806401E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:38:49.463837 | finish at 2025-09-10 11:54:51 + [2025-09-10 11:16:08] iteration 11507/ 11920 | consumed samples: 11783168 | elapsed time per iteration (ms): 6232.2 | throughput per GPU (TFLOP/s/GPU): 72.4 | MFU 7.33% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.796293E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:42:53.895706 | finish at 2025-09-10 11:59:02 + [2025-09-10 11:16:13] iteration 11508/ 11920 | consumed samples: 11784192 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800146E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:38:37.407076 | finish at 2025-09-10 11:54:51 + [2025-09-10 11:16:19] iteration 11509/ 11920 | consumed samples: 11785216 | elapsed time per iteration (ms): 5616.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.796867E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:38:28.430944 | finish at 2025-09-10 11:54:47 + [2025-09-10 11:16:25] iteration 11510/ 11920 | consumed samples: 11786240 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.789030E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:38:25.758407 | finish at 2025-09-10 11:54:50 + [2025-09-10 11:16:30] iteration 11511/ 11920 | consumed samples: 11787264 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809954E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:38:19.513447 | finish at 2025-09-10 11:54:50 + [2025-09-10 11:16:36] iteration 11512/ 11920 | consumed samples: 11788288 | elapsed time per iteration (ms): 5632.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.793228E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:38:18.039448 | finish at 2025-09-10 11:54:54 + [2025-09-10 11:16:41] iteration 11513/ 11920 | consumed samples: 11789312 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.789292E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:38:09.303290 | finish at 2025-09-10 11:54:51 + [2025-09-10 11:16:47] iteration 11514/ 11920 | consumed samples: 11790336 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.797913E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:38:03.182280 | finish at 2025-09-10 11:54:50 + [2025-09-10 11:16:53] iteration 11515/ 11920 | consumed samples: 11791360 | elapsed time per iteration (ms): 5647.6 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800957E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:38:07.292554 | finish at 2025-09-10 11:55:00 + [2025-09-10 11:16:58] iteration 11516/ 11920 | consumed samples: 11792384 | elapsed time per iteration (ms): 5636.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808967E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:37:57.058493 | finish at 2025-09-10 11:54:55 + [2025-09-10 11:17:04] iteration 11517/ 11920 | consumed samples: 11793408 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.790811E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:37:46.992797 | finish at 2025-09-10 11:54:51 + [2025-09-10 11:17:10] iteration 11518/ 11920 | consumed samples: 11794432 | elapsed time per iteration (ms): 5854.8 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813104E+00 | loss scale: 1.0 | grad norm: 0.251 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:39:13.626623 | finish at 2025-09-10 11:56:23 + [2025-09-10 11:17:15] iteration 11519/ 11920 | consumed samples: 11795456 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.793631E+00 | loss scale: 1.0 | grad norm: 0.255 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:37:33.806864 | finish at 2025-09-10 11:54:49 + [2025-09-10 11:17:21] iteration 11520/ 11920 | consumed samples: 11796480 | elapsed time per iteration (ms): 5618.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812388E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:37:27.340202 | finish at 2025-09-10 11:54:48 + [2025-09-10 11:17:27] iteration 11521/ 11920 | consumed samples: 11797504 | elapsed time per iteration (ms): 5628.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805943E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:37:25.558595 | finish at 2025-09-10 11:54:52 + [2025-09-10 11:17:32] iteration 11522/ 11920 | consumed samples: 11798528 | elapsed time per iteration (ms): 5630.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803807E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:37:21.036674 | finish at 2025-09-10 11:54:53 + [2025-09-10 11:17:38] iteration 11523/ 11920 | consumed samples: 11799552 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.794627E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:37:11.328596 | finish at 2025-09-10 11:54:49 + [2025-09-10 11:17:44] iteration 11524/ 11920 | consumed samples: 11800576 | elapsed time per iteration (ms): 5619.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798419E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:37:05.218869 | finish at 2025-09-10 11:54:49 + [2025-09-10 11:17:49] iteration 11525/ 11920 | consumed samples: 11801600 | elapsed time per iteration (ms): 5628.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.827025E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:37:03.057277 | finish at 2025-09-10 11:54:52 + [2025-09-10 11:17:55] iteration 11526/ 11920 | consumed samples: 11802624 | elapsed time per iteration (ms): 5989.0 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802930E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:39:19.657014 | finish at 2025-09-10 11:57:15 + [2025-09-10 11:18:01] iteration 11527/ 11920 | consumed samples: 11803648 | elapsed time per iteration (ms): 6073.1 | throughput per GPU (TFLOP/s/GPU): 74.3 | MFU 7.52% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798075E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:39:46.746232 | finish at 2025-09-10 11:57:48 + [2025-09-10 11:18:07] iteration 11528/ 11920 | consumed samples: 11804672 | elapsed time per iteration (ms): 5625.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798674E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:36:45.348326 | finish at 2025-09-10 11:54:52 + [2025-09-10 11:18:13] iteration 11529/ 11920 | consumed samples: 11805696 | elapsed time per iteration (ms): 5619.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.797541E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:36:37.050704 | finish at 2025-09-10 11:54:50 + [2025-09-10 11:18:18] iteration 11530/ 11920 | consumed samples: 11806720 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802111E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:36:33.793051 | finish at 2025-09-10 11:54:52 + [2025-09-10 11:18:24] iteration 11531/ 11920 | consumed samples: 11807744 | elapsed time per iteration (ms): 5970.2 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.791020E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:38:42.426837 | finish at 2025-09-10 11:57:07 + [2025-09-10 11:18:30] iteration 11532/ 11920 | consumed samples: 11808768 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.792872E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:36:20.783636 | finish at 2025-09-10 11:54:51 + [2025-09-10 11:18:36] iteration 11533/ 11920 | consumed samples: 11809792 | elapsed time per iteration (ms): 5946.3 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806729E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:38:21.200581 | finish at 2025-09-10 11:56:57 + [2025-09-10 11:18:41] iteration 11534/ 11920 | consumed samples: 11810816 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803068E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:36:10.448146 | finish at 2025-09-10 11:54:52 + [2025-09-10 11:18:47] iteration 11535/ 11920 | consumed samples: 11811840 | elapsed time per iteration (ms): 5629.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.796010E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:36:07.438151 | finish at 2025-09-10 11:54:54 + [2025-09-10 11:18:53] iteration 11536/ 11920 | consumed samples: 11812864 | elapsed time per iteration (ms): 5947.7 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798122E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:38:03.910400 | finish at 2025-09-10 11:56:57 + [2025-09-10 11:18:59] iteration 11537/ 11920 | consumed samples: 11813888 | elapsed time per iteration (ms): 5634.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810323E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:35:57.815084 | finish at 2025-09-10 11:54:56 + [2025-09-10 11:19:04] iteration 11538/ 11920 | consumed samples: 11814912 | elapsed time per iteration (ms): 5630.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.785767E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:35:51.015604 | finish at 2025-09-10 11:54:55 + [2025-09-10 11:19:10] iteration 11539/ 11920 | consumed samples: 11815936 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802299E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:35:42.766010 | finish at 2025-09-10 11:54:53 + [2025-09-10 11:19:15] iteration 11540/ 11920 | consumed samples: 11816960 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809779E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:35:35.528111 | finish at 2025-09-10 11:54:51 + [2025-09-10 11:19:21] iteration 11541/ 11920 | consumed samples: 11817984 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.788405E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:35:30.139262 | finish at 2025-09-10 11:54:51 + [2025-09-10 11:19:27] iteration 11542/ 11920 | consumed samples: 11819008 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.779901E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:35:24.642941 | finish at 2025-09-10 11:54:51 + [2025-09-10 11:19:32] iteration 11543/ 11920 | consumed samples: 11820032 | elapsed time per iteration (ms): 5633.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.786896E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:35:23.906568 | finish at 2025-09-10 11:54:56 + [2025-09-10 11:19:38] iteration 11544/ 11920 | consumed samples: 11821056 | elapsed time per iteration (ms): 5961.5 | throughput per GPU (TFLOP/s/GPU): 75.7 | MFU 7.66% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.796659E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:37:21.534821 | finish at 2025-09-10 11:57:00 + [2025-09-10 11:19:44] iteration 11545/ 11920 | consumed samples: 11822080 | elapsed time per iteration (ms): 5647.7 | throughput per GPU (TFLOP/s/GPU): 79.9 | MFU 8.08% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802392E+00 | loss scale: 1.0 | grad norm: 0.143 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:35:17.870629 | finish at 2025-09-10 11:55:02 + [2025-09-10 11:19:50] iteration 11546/ 11920 | consumed samples: 11823104 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.796888E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:35:03.161131 | finish at 2025-09-10 11:54:53 + [2025-09-10 11:19:55] iteration 11547/ 11920 | consumed samples: 11824128 | elapsed time per iteration (ms): 5942.3 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.790577E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:36:56.477664 | finish at 2025-09-10 11:56:52 + [2025-09-10 11:20:01] iteration 11548/ 11920 | consumed samples: 11825152 | elapsed time per iteration (ms): 5881.8 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807571E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:36:28.025851 | finish at 2025-09-10 11:56:29 + [2025-09-10 11:20:07] iteration 11549/ 11920 | consumed samples: 11826176 | elapsed time per iteration (ms): 5630.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.794933E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:34:49.040160 | finish at 2025-09-10 11:54:56 + [2025-09-10 11:20:13] iteration 11550/ 11920 | consumed samples: 11827200 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798028E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:34:40.868824 | finish at 2025-09-10 11:54:53 + [2025-09-10 11:20:19] iteration 11551/ 11920 | consumed samples: 11828224 | elapsed time per iteration (ms): 6007.6 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.796733E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:36:56.796332 | finish at 2025-09-10 11:57:15 + [2025-09-10 11:20:24] iteration 11552/ 11920 | consumed samples: 11829248 | elapsed time per iteration (ms): 5633.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804807E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:34:33.290878 | finish at 2025-09-10 11:54:58 + [2025-09-10 11:20:30] iteration 11553/ 11920 | consumed samples: 11830272 | elapsed time per iteration (ms): 5622.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.795511E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:34:23.463516 | finish at 2025-09-10 11:54:53 + [2025-09-10 11:20:36] iteration 11554/ 11920 | consumed samples: 11831296 | elapsed time per iteration (ms): 6275.2 | throughput per GPU (TFLOP/s/GPU): 71.9 | MFU 7.27% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811213E+00 | loss scale: 1.0 | grad norm: 0.240 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:38:16.728919 | finish at 2025-09-10 11:58:53 + [2025-09-10 11:20:42] iteration 11555/ 11920 | consumed samples: 11832320 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.797628E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:34:13.032582 | finish at 2025-09-10 11:54:55 + [2025-09-10 11:20:47] iteration 11556/ 11920 | consumed samples: 11833344 | elapsed time per iteration (ms): 5630.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803977E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:34:09.326377 | finish at 2025-09-10 11:54:57 + [2025-09-10 11:20:53] iteration 11557/ 11920 | consumed samples: 11834368 | elapsed time per iteration (ms): 5637.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.790903E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:34:06.310739 | finish at 2025-09-10 11:54:59 + [2025-09-10 11:20:59] iteration 11558/ 11920 | consumed samples: 11835392 | elapsed time per iteration (ms): 5633.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.795696E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:33:59.306581 | finish at 2025-09-10 11:54:58 + [2025-09-10 11:21:04] iteration 11559/ 11920 | consumed samples: 11836416 | elapsed time per iteration (ms): 5625.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.794400E+00 | loss scale: 1.0 | grad norm: 0.224 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:33:50.757719 | finish at 2025-09-10 11:54:55 + [2025-09-10 11:21:10] iteration 11560/ 11920 | consumed samples: 11837440 | elapsed time per iteration (ms): 5628.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809155E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:33:46.359215 | finish at 2025-09-10 11:54:56 + [2025-09-10 11:21:16] iteration 11561/ 11920 | consumed samples: 11838464 | elapsed time per iteration (ms): 5618.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811302E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:33:37.171170 | finish at 2025-09-10 11:54:53 + [2025-09-10 11:21:21] iteration 11562/ 11920 | consumed samples: 11839488 | elapsed time per iteration (ms): 5618.5 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799795E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:33:31.435886 | finish at 2025-09-10 11:54:53 + [2025-09-10 11:21:27] iteration 11563/ 11920 | consumed samples: 11840512 | elapsed time per iteration (ms): 5617.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798815E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:33:25.481144 | finish at 2025-09-10 11:54:52 + [2025-09-10 11:21:32] iteration 11564/ 11920 | consumed samples: 11841536 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798274E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:33:21.997783 | finish at 2025-09-10 11:54:54 + [2025-09-10 11:21:38] iteration 11565/ 11920 | consumed samples: 11842560 | elapsed time per iteration (ms): 5617.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799177E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:33:14.343883 | finish at 2025-09-10 11:54:52 + [2025-09-10 11:21:44] iteration 11566/ 11920 | consumed samples: 11843584 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805685E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:33:09.481901 | finish at 2025-09-10 11:54:53 + [2025-09-10 11:21:50] iteration 11567/ 11920 | consumed samples: 11844608 | elapsed time per iteration (ms): 5923.6 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802036E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:34:51.013869 | finish at 2025-09-10 11:56:41 + [2025-09-10 11:21:55] iteration 11568/ 11920 | consumed samples: 11845632 | elapsed time per iteration (ms): 5639.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800880E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:33:05.229515 | finish at 2025-09-10 11:55:00 + [2025-09-10 11:22:01] iteration 11569/ 11920 | consumed samples: 11846656 | elapsed time per iteration (ms): 5639.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799090E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:32:59.596855 | finish at 2025-09-10 11:55:00 + [2025-09-10 11:22:06] iteration 11570/ 11920 | consumed samples: 11847680 | elapsed time per iteration (ms): 5634.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803365E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:32:52.053981 | finish at 2025-09-10 11:54:59 + [2025-09-10 11:22:12] iteration 11571/ 11920 | consumed samples: 11848704 | elapsed time per iteration (ms): 5624.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.790899E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:32:43.094296 | finish at 2025-09-10 11:54:55 + [2025-09-10 11:22:18] iteration 11572/ 11920 | consumed samples: 11849728 | elapsed time per iteration (ms): 5617.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.786459E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:32:34.822652 | finish at 2025-09-10 11:54:53 + [2025-09-10 11:22:23] iteration 11573/ 11920 | consumed samples: 11850752 | elapsed time per iteration (ms): 5622.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810163E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:32:31.037263 | finish at 2025-09-10 11:54:54 + [2025-09-10 11:22:29] iteration 11574/ 11920 | consumed samples: 11851776 | elapsed time per iteration (ms): 5618.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.786738E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:32:24.139916 | finish at 2025-09-10 11:54:53 + [2025-09-10 11:22:35] iteration 11575/ 11920 | consumed samples: 11852800 | elapsed time per iteration (ms): 6014.0 | throughput per GPU (TFLOP/s/GPU): 75.1 | MFU 7.59% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.778399E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:34:34.828005 | finish at 2025-09-10 11:57:10 + [2025-09-10 11:22:41] iteration 11576/ 11920 | consumed samples: 11853824 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.795812E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:32:14.624285 | finish at 2025-09-10 11:54:55 + [2025-09-10 11:22:46] iteration 11577/ 11920 | consumed samples: 11854848 | elapsed time per iteration (ms): 5630.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807993E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:32:11.198967 | finish at 2025-09-10 11:54:57 + [2025-09-10 11:22:52] iteration 11578/ 11920 | consumed samples: 11855872 | elapsed time per iteration (ms): 5869.6 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.814927E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:33:27.409091 | finish at 2025-09-10 11:56:20 + [2025-09-10 11:22:58] iteration 11579/ 11920 | consumed samples: 11856896 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809708E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:31:56.997847 | finish at 2025-09-10 11:54:55 + [2025-09-10 11:23:04] iteration 11580/ 11920 | consumed samples: 11857920 | elapsed time per iteration (ms): 5839.6 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.790452E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:33:05.462165 | finish at 2025-09-10 11:56:09 + [2025-09-10 11:23:09] iteration 11581/ 11920 | consumed samples: 11858944 | elapsed time per iteration (ms): 5630.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798248E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:31:48.569715 | finish at 2025-09-10 11:54:58 + [2025-09-10 11:23:15] iteration 11582/ 11920 | consumed samples: 11859968 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800979E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:31:39.487998 | finish at 2025-09-10 11:54:54 + [2025-09-10 11:23:20] iteration 11583/ 11920 | consumed samples: 11860992 | elapsed time per iteration (ms): 5632.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802389E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:31:38.032760 | finish at 2025-09-10 11:54:58 + [2025-09-10 11:23:26] iteration 11584/ 11920 | consumed samples: 11862016 | elapsed time per iteration (ms): 5617.2 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800629E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:31:27.375401 | finish at 2025-09-10 11:54:53 + [2025-09-10 11:23:32] iteration 11585/ 11920 | consumed samples: 11863040 | elapsed time per iteration (ms): 5627.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.782845E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:31:25.196146 | finish at 2025-09-10 11:54:57 + [2025-09-10 11:23:37] iteration 11586/ 11920 | consumed samples: 11864064 | elapsed time per iteration (ms): 5619.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.788663E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:31:16.755860 | finish at 2025-09-10 11:54:54 + [2025-09-10 11:23:43] iteration 11587/ 11920 | consumed samples: 11865088 | elapsed time per iteration (ms): 5628.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.789654E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:31:14.325984 | finish at 2025-09-10 11:54:57 + [2025-09-10 11:23:49] iteration 11588/ 11920 | consumed samples: 11866112 | elapsed time per iteration (ms): 5636.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804214E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:31:11.204453 | finish at 2025-09-10 11:55:00 + [2025-09-10 11:23:54] iteration 11589/ 11920 | consumed samples: 11867136 | elapsed time per iteration (ms): 5636.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.792167E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:31:05.569715 | finish at 2025-09-10 11:55:00 + [2025-09-10 11:24:00] iteration 11590/ 11920 | consumed samples: 11868160 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799731E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:30:55.220103 | finish at 2025-09-10 11:54:55 + [2025-09-10 11:24:05] iteration 11591/ 11920 | consumed samples: 11869184 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.793420E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:30:50.043370 | finish at 2025-09-10 11:54:56 + [2025-09-10 11:24:11] iteration 11592/ 11920 | consumed samples: 11870208 | elapsed time per iteration (ms): 5628.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803707E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:30:46.269911 | finish at 2025-09-10 11:54:57 + [2025-09-10 11:24:17] iteration 11593/ 11920 | consumed samples: 11871232 | elapsed time per iteration (ms): 5630.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815865E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:30:41.217419 | finish at 2025-09-10 11:54:58 + [2025-09-10 11:24:23] iteration 11594/ 11920 | consumed samples: 11872256 | elapsed time per iteration (ms): 5989.8 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800191E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:32:32.667175 | finish at 2025-09-10 11:56:55 + [2025-09-10 11:24:29] iteration 11595/ 11920 | consumed samples: 11873280 | elapsed time per iteration (ms): 5975.3 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813249E+00 | loss scale: 1.0 | grad norm: 0.236 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:32:21.975468 | finish at 2025-09-10 11:56:51 + [2025-09-10 11:24:34] iteration 11596/ 11920 | consumed samples: 11874304 | elapsed time per iteration (ms): 5617.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802999E+00 | loss scale: 1.0 | grad norm: 0.255 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:30:20.158702 | finish at 2025-09-10 11:54:54 + [2025-09-10 11:24:40] iteration 11597/ 11920 | consumed samples: 11875328 | elapsed time per iteration (ms): 6006.9 | throughput per GPU (TFLOP/s/GPU): 75.2 | MFU 7.60% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.795626E+00 | loss scale: 1.0 | grad norm: 0.253 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:32:20.218789 | finish at 2025-09-10 11:57:01 + [2025-09-10 11:24:46] iteration 11598/ 11920 | consumed samples: 11876352 | elapsed time per iteration (ms): 5618.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.792116E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:30:09.236379 | finish at 2025-09-10 11:54:55 + [2025-09-10 11:24:52] iteration 11599/ 11920 | consumed samples: 11877376 | elapsed time per iteration (ms): 5635.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798339E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:30:08.940918 | finish at 2025-09-10 11:55:01 + [2025-09-10 11:24:57] iteration 11600/ 11920 | consumed samples: 11878400 | elapsed time per iteration (ms): 5625.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804128E+00 | loss scale: 1.0 | grad norm: 0.230 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:30:00.297623 | finish at 2025-09-10 11:54:57 + [2025-09-10 11:25:03] iteration 11601/ 11920 | consumed samples: 11879424 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815207E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:29:53.937148 | finish at 2025-09-10 11:54:57 + [2025-09-10 11:25:08] iteration 11602/ 11920 | consumed samples: 11880448 | elapsed time per iteration (ms): 5631.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801545E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:29:50.926406 | finish at 2025-09-10 11:54:59 + [2025-09-10 11:25:14] iteration 11603/ 11920 | consumed samples: 11881472 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805009E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:29:42.199085 | finish at 2025-09-10 11:54:56 + [2025-09-10 11:25:20] iteration 11604/ 11920 | consumed samples: 11882496 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800717E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:29:36.316480 | finish at 2025-09-10 11:54:56 + [2025-09-10 11:25:25] iteration 11605/ 11920 | consumed samples: 11883520 | elapsed time per iteration (ms): 5631.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.796237E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:29:34.036657 | finish at 2025-09-10 11:54:59 + [2025-09-10 11:25:31] iteration 11606/ 11920 | consumed samples: 11884544 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.790509E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:29:25.017149 | finish at 2025-09-10 11:54:56 + [2025-09-10 11:25:37] iteration 11607/ 11920 | consumed samples: 11885568 | elapsed time per iteration (ms): 5621.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803192E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:29:19.373165 | finish at 2025-09-10 11:54:56 + [2025-09-10 11:25:42] iteration 11608/ 11920 | consumed samples: 11886592 | elapsed time per iteration (ms): 5616.7 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806479E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:29:12.416182 | finish at 2025-09-10 11:54:55 + [2025-09-10 11:25:48] iteration 11609/ 11920 | consumed samples: 11887616 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811921E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:29:08.402398 | finish at 2025-09-10 11:54:56 + [2025-09-10 11:25:53] iteration 11610/ 11920 | consumed samples: 11888640 | elapsed time per iteration (ms): 5629.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802123E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:29:05.193605 | finish at 2025-09-10 11:54:59 + [2025-09-10 11:25:59] iteration 11611/ 11920 | consumed samples: 11889664 | elapsed time per iteration (ms): 5623.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.796833E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:28:57.704705 | finish at 2025-09-10 11:54:57 + [2025-09-10 11:26:05] iteration 11612/ 11920 | consumed samples: 11890688 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.790480E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:28:51.656696 | finish at 2025-09-10 11:54:56 + [2025-09-10 11:26:10] iteration 11613/ 11920 | consumed samples: 11891712 | elapsed time per iteration (ms): 5626.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.795632E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:28:47.190029 | finish at 2025-09-10 11:54:57 + [2025-09-10 11:26:16] iteration 11614/ 11920 | consumed samples: 11892736 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.787593E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:28:40.618565 | finish at 2025-09-10 11:54:57 + [2025-09-10 11:26:22] iteration 11615/ 11920 | consumed samples: 11893760 | elapsed time per iteration (ms): 5828.6 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800385E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:29:37.724433 | finish at 2025-09-10 11:55:59 + [2025-09-10 11:26:27] iteration 11616/ 11920 | consumed samples: 11894784 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.791485E+00 | loss scale: 1.0 | grad norm: 0.135 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:28:28.885849 | finish at 2025-09-10 11:54:56 + [2025-09-10 11:26:33] iteration 11617/ 11920 | consumed samples: 11895808 | elapsed time per iteration (ms): 5625.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800846E+00 | loss scale: 1.0 | grad norm: 0.131 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:28:24.518543 | finish at 2025-09-10 11:54:58 + [2025-09-10 11:26:39] iteration 11618/ 11920 | consumed samples: 11896832 | elapsed time per iteration (ms): 5619.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.794661E+00 | loss scale: 1.0 | grad norm: 0.130 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:28:16.975285 | finish at 2025-09-10 11:54:56 + [2025-09-10 11:26:44] iteration 11619/ 11920 | consumed samples: 11897856 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803931E+00 | loss scale: 1.0 | grad norm: 0.122 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:28:11.710603 | finish at 2025-09-10 11:54:56 + [2025-09-10 11:26:50] iteration 11620/ 11920 | consumed samples: 11898880 | elapsed time per iteration (ms): 5887.8 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.75% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808006E+00 | loss scale: 1.0 | grad norm: 0.133 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:29:26.346002 | finish at 2025-09-10 11:56:16 + [2025-09-10 11:26:56] iteration 11621/ 11920 | consumed samples: 11899904 | elapsed time per iteration (ms): 5987.5 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.788639E+00 | loss scale: 1.0 | grad norm: 0.148 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:29:50.268403 | finish at 2025-09-10 11:56:46 + [2025-09-10 11:27:02] iteration 11622/ 11920 | consumed samples: 11900928 | elapsed time per iteration (ms): 5623.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.788564E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:27:55.742925 | finish at 2025-09-10 11:54:57 + [2025-09-10 11:27:07] iteration 11623/ 11920 | consumed samples: 11901952 | elapsed time per iteration (ms): 5642.3 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798208E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:27:55.749401 | finish at 2025-09-10 11:55:03 + [2025-09-10 11:27:13] iteration 11624/ 11920 | consumed samples: 11902976 | elapsed time per iteration (ms): 5657.7 | throughput per GPU (TFLOP/s/GPU): 79.8 | MFU 8.07% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.793969E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:27:54.682253 | finish at 2025-09-10 11:55:08 + [2025-09-10 11:27:19] iteration 11625/ 11920 | consumed samples: 11904000 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800692E+00 | loss scale: 1.0 | grad norm: 0.230 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:27:38.427116 | finish at 2025-09-10 11:54:57 + [2025-09-10 11:27:24] iteration 11626/ 11920 | consumed samples: 11905024 | elapsed time per iteration (ms): 5620.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805073E+00 | loss scale: 1.0 | grad norm: 0.250 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:27:32.414198 | finish at 2025-09-10 11:54:57 + [2025-09-10 11:27:30] iteration 11627/ 11920 | consumed samples: 11906048 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800093E+00 | loss scale: 1.0 | grad norm: 0.223 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:27:27.818748 | finish at 2025-09-10 11:54:58 + [2025-09-10 11:27:36] iteration 11628/ 11920 | consumed samples: 11907072 | elapsed time per iteration (ms): 5626.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801078E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:27:22.795181 | finish at 2025-09-10 11:54:58 + [2025-09-10 11:27:42] iteration 11629/ 11920 | consumed samples: 11908096 | elapsed time per iteration (ms): 6179.3 | throughput per GPU (TFLOP/s/GPU): 73.1 | MFU 7.39% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803023E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:29:58.166885 | finish at 2025-09-10 11:57:40 + [2025-09-10 11:27:47] iteration 11630/ 11920 | consumed samples: 11909120 | elapsed time per iteration (ms): 5627.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.792349E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:27:12.012146 | finish at 2025-09-10 11:54:59 + [2025-09-10 11:27:53] iteration 11631/ 11920 | consumed samples: 11910144 | elapsed time per iteration (ms): 5631.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.790355E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:27:07.549322 | finish at 2025-09-10 11:55:01 + [2025-09-10 11:27:59] iteration 11632/ 11920 | consumed samples: 11911168 | elapsed time per iteration (ms): 5628.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.795204E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:27:01.073845 | finish at 2025-09-10 11:55:00 + [2025-09-10 11:28:04] iteration 11633/ 11920 | consumed samples: 11912192 | elapsed time per iteration (ms): 5617.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804653E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:26:52.195012 | finish at 2025-09-10 11:54:56 + [2025-09-10 11:28:10] iteration 11634/ 11920 | consumed samples: 11913216 | elapsed time per iteration (ms): 5621.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806778E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:26:47.623266 | finish at 2025-09-10 11:54:57 + [2025-09-10 11:28:16] iteration 11635/ 11920 | consumed samples: 11914240 | elapsed time per iteration (ms): 5937.5 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800992E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:28:12.187160 | finish at 2025-09-10 11:56:28 + [2025-09-10 11:28:21] iteration 11636/ 11920 | consumed samples: 11915264 | elapsed time per iteration (ms): 5619.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.792127E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:26:36.050716 | finish at 2025-09-10 11:54:57 + [2025-09-10 11:28:27] iteration 11637/ 11920 | consumed samples: 11916288 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.773079E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:26:30.797330 | finish at 2025-09-10 11:54:58 + [2025-09-10 11:28:33] iteration 11638/ 11920 | consumed samples: 11917312 | elapsed time per iteration (ms): 6061.7 | throughput per GPU (TFLOP/s/GPU): 74.5 | MFU 7.53% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799610E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:28:29.397682 | finish at 2025-09-10 11:57:02 + [2025-09-10 11:28:39] iteration 11639/ 11920 | consumed samples: 11918336 | elapsed time per iteration (ms): 5619.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805249E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:26:19.027422 | finish at 2025-09-10 11:54:58 + [2025-09-10 11:28:44] iteration 11640/ 11920 | consumed samples: 11919360 | elapsed time per iteration (ms): 5620.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798589E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:26:13.700371 | finish at 2025-09-10 11:54:58 + [2025-09-10 11:28:50] iteration 11641/ 11920 | consumed samples: 11920384 | elapsed time per iteration (ms): 5618.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799672E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:26:07.649902 | finish at 2025-09-10 11:54:58 + [2025-09-10 11:28:56] iteration 11642/ 11920 | consumed samples: 11921408 | elapsed time per iteration (ms): 5629.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.786574E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:26:04.894331 | finish at 2025-09-10 11:55:00 + [2025-09-10 11:29:01] iteration 11643/ 11920 | consumed samples: 11922432 | elapsed time per iteration (ms): 5626.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.794265E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:25:58.590728 | finish at 2025-09-10 11:55:00 + [2025-09-10 11:29:07] iteration 11644/ 11920 | consumed samples: 11923456 | elapsed time per iteration (ms): 5829.1 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804454E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:26:48.823148 | finish at 2025-09-10 11:55:56 + [2025-09-10 11:29:13] iteration 11645/ 11920 | consumed samples: 11924480 | elapsed time per iteration (ms): 5638.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.796996E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:25:50.590706 | finish at 2025-09-10 11:55:03 + [2025-09-10 11:29:18] iteration 11646/ 11920 | consumed samples: 11925504 | elapsed time per iteration (ms): 5618.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.788177E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:25:39.445677 | finish at 2025-09-10 11:54:58 + [2025-09-10 11:29:24] iteration 11647/ 11920 | consumed samples: 11926528 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803284E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:25:34.841598 | finish at 2025-09-10 11:54:59 + [2025-09-10 11:29:30] iteration 11648/ 11920 | consumed samples: 11927552 | elapsed time per iteration (ms): 5613.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800447E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:25:26.809193 | finish at 2025-09-10 11:54:56 + [2025-09-10 11:29:35] iteration 11649/ 11920 | consumed samples: 11928576 | elapsed time per iteration (ms): 5933.0 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.789669E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:26:47.841021 | finish at 2025-09-10 11:56:23 + [2025-09-10 11:29:41] iteration 11650/ 11920 | consumed samples: 11929600 | elapsed time per iteration (ms): 5619.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.791941E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:25:17.279077 | finish at 2025-09-10 11:54:58 + [2025-09-10 11:29:47] iteration 11651/ 11920 | consumed samples: 11930624 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.796144E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:25:11.721478 | finish at 2025-09-10 11:54:58 + [2025-09-10 11:29:53] iteration 11652/ 11920 | consumed samples: 11931648 | elapsed time per iteration (ms): 5921.4 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.777072E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:26:26.938925 | finish at 2025-09-10 11:56:20 + [2025-09-10 11:29:59] iteration 11653/ 11920 | consumed samples: 11932672 | elapsed time per iteration (ms): 5932.1 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.794300E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:26:23.859703 | finish at 2025-09-10 11:56:22 + [2025-09-10 11:30:04] iteration 11654/ 11920 | consumed samples: 11933696 | elapsed time per iteration (ms): 5843.7 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.795774E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:25:54.422216 | finish at 2025-09-10 11:55:59 + [2025-09-10 11:30:10] iteration 11655/ 11920 | consumed samples: 11934720 | elapsed time per iteration (ms): 5904.7 | throughput per GPU (TFLOP/s/GPU): 76.5 | MFU 7.73% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.797350E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:26:04.739509 | finish at 2025-09-10 11:56:15 + [2025-09-10 11:30:16] iteration 11656/ 11920 | consumed samples: 11935744 | elapsed time per iteration (ms): 5625.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.793641E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:24:45.077042 | finish at 2025-09-10 11:55:01 + [2025-09-10 11:30:22] iteration 11657/ 11920 | consumed samples: 11936768 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.789782E+00 | loss scale: 1.0 | grad norm: 0.179 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:24:38.381015 | finish at 2025-09-10 11:55:00 + [2025-09-10 11:30:27] iteration 11658/ 11920 | consumed samples: 11937792 | elapsed time per iteration (ms): 5618.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.790253E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:24:32.139198 | finish at 2025-09-10 11:54:59 + [2025-09-10 11:30:33] iteration 11659/ 11920 | consumed samples: 11938816 | elapsed time per iteration (ms): 5631.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808331E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:24:29.923492 | finish at 2025-09-10 11:55:03 + [2025-09-10 11:30:38] iteration 11660/ 11920 | consumed samples: 11939840 | elapsed time per iteration (ms): 5615.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.795357E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:24:20.015488 | finish at 2025-09-10 11:54:58 + [2025-09-10 11:30:44] iteration 11661/ 11920 | consumed samples: 11940864 | elapsed time per iteration (ms): 5616.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.791365E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:24:14.635436 | finish at 2025-09-10 11:54:59 + [2025-09-10 11:30:50] iteration 11662/ 11920 | consumed samples: 11941888 | elapsed time per iteration (ms): 5615.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805295E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:24:08.873730 | finish at 2025-09-10 11:54:59 + [2025-09-10 11:30:56] iteration 11663/ 11920 | consumed samples: 11942912 | elapsed time per iteration (ms): 5852.4 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800438E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:25:04.069819 | finish at 2025-09-10 11:56:00 + [2025-09-10 11:31:01] iteration 11664/ 11920 | consumed samples: 11943936 | elapsed time per iteration (ms): 5861.4 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808158E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:25:00.514038 | finish at 2025-09-10 11:56:02 + [2025-09-10 11:31:07] iteration 11665/ 11920 | consumed samples: 11944960 | elapsed time per iteration (ms): 5619.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.794075E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:23:52.968042 | finish at 2025-09-10 11:55:00 + [2025-09-10 11:31:13] iteration 11666/ 11920 | consumed samples: 11945984 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.791901E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:23:48.706580 | finish at 2025-09-10 11:55:01 + [2025-09-10 11:31:18] iteration 11667/ 11920 | consumed samples: 11947008 | elapsed time per iteration (ms): 5616.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806175E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:23:40.919523 | finish at 2025-09-10 11:54:59 + [2025-09-10 11:31:24] iteration 11668/ 11920 | consumed samples: 11948032 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798712E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:23:37.044823 | finish at 2025-09-10 11:55:01 + [2025-09-10 11:31:30] iteration 11669/ 11920 | consumed samples: 11949056 | elapsed time per iteration (ms): 5911.2 | throughput per GPU (TFLOP/s/GPU): 76.4 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805301E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:24:43.715939 | finish at 2025-09-10 11:56:13 + [2025-09-10 11:31:35] iteration 11670/ 11920 | consumed samples: 11950080 | elapsed time per iteration (ms): 5618.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.787472E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:23:24.709935 | finish at 2025-09-10 11:55:00 + [2025-09-10 11:31:41] iteration 11671/ 11920 | consumed samples: 11951104 | elapsed time per iteration (ms): 5620.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.784521E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:23:19.565669 | finish at 2025-09-10 11:55:01 + [2025-09-10 11:31:47] iteration 11672/ 11920 | consumed samples: 11952128 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799285E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:23:14.182085 | finish at 2025-09-10 11:55:01 + [2025-09-10 11:31:52] iteration 11673/ 11920 | consumed samples: 11953152 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.796936E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:23:09.531822 | finish at 2025-09-10 11:55:02 + [2025-09-10 11:31:58] iteration 11674/ 11920 | consumed samples: 11954176 | elapsed time per iteration (ms): 5939.1 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800091E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:24:21.017903 | finish at 2025-09-10 11:56:19 + [2025-09-10 11:32:04] iteration 11675/ 11920 | consumed samples: 11955200 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801757E+00 | loss scale: 1.0 | grad norm: 0.227 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:22:56.977077 | finish at 2025-09-10 11:55:01 + [2025-09-10 11:32:09] iteration 11676/ 11920 | consumed samples: 11956224 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.790850E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:22:51.702258 | finish at 2025-09-10 11:55:01 + [2025-09-10 11:32:15] iteration 11677/ 11920 | consumed samples: 11957248 | elapsed time per iteration (ms): 5860.0 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799507E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:23:43.973312 | finish at 2025-09-10 11:55:59 + [2025-09-10 11:32:21] iteration 11678/ 11920 | consumed samples: 11958272 | elapsed time per iteration (ms): 5613.6 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809702E+00 | loss scale: 1.0 | grad norm: 0.234 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:22:38.487165 | finish at 2025-09-10 11:54:59 + [2025-09-10 11:32:27] iteration 11679/ 11920 | consumed samples: 11959296 | elapsed time per iteration (ms): 5623.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.789479E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:22:35.133669 | finish at 2025-09-10 11:55:02 + [2025-09-10 11:32:32] iteration 11680/ 11920 | consumed samples: 11960320 | elapsed time per iteration (ms): 5947.1 | throughput per GPU (TFLOP/s/GPU): 75.9 | MFU 7.68% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.786580E+00 | loss scale: 1.0 | grad norm: 0.229 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:23:47.308102 | finish at 2025-09-10 11:56:20 + [2025-09-10 11:32:38] iteration 11681/ 11920 | consumed samples: 11961344 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.787710E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:22:24.420130 | finish at 2025-09-10 11:55:03 + [2025-09-10 11:32:44] iteration 11682/ 11920 | consumed samples: 11962368 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798271E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:22:18.381110 | finish at 2025-09-10 11:55:02 + [2025-09-10 11:32:49] iteration 11683/ 11920 | consumed samples: 11963392 | elapsed time per iteration (ms): 5631.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.797209E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:22:14.755797 | finish at 2025-09-10 11:55:04 + [2025-09-10 11:32:55] iteration 11684/ 11920 | consumed samples: 11964416 | elapsed time per iteration (ms): 5629.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799486E+00 | loss scale: 1.0 | grad norm: 0.221 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:22:08.438024 | finish at 2025-09-10 11:55:03 + [2025-09-10 11:33:01] iteration 11685/ 11920 | consumed samples: 11965440 | elapsed time per iteration (ms): 5619.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.797353E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:22:00.531104 | finish at 2025-09-10 11:55:01 + [2025-09-10 11:33:06] iteration 11686/ 11920 | consumed samples: 11966464 | elapsed time per iteration (ms): 5615.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800107E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:21:54.086020 | finish at 2025-09-10 11:55:00 + [2025-09-10 11:33:12] iteration 11687/ 11920 | consumed samples: 11967488 | elapsed time per iteration (ms): 5632.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798917E+00 | loss scale: 1.0 | grad norm: 0.192 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:21:52.471922 | finish at 2025-09-10 11:55:04 + [2025-09-10 11:33:18] iteration 11688/ 11920 | consumed samples: 11968512 | elapsed time per iteration (ms): 5984.9 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798208E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:23:08.507198 | finish at 2025-09-10 11:56:26 + [2025-09-10 11:33:24] iteration 11689/ 11920 | consumed samples: 11969536 | elapsed time per iteration (ms): 5860.0 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.795625E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:22:33.662234 | finish at 2025-09-10 11:55:57 + [2025-09-10 11:33:30] iteration 11690/ 11920 | consumed samples: 11970560 | elapsed time per iteration (ms): 5990.3 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.778235E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:22:57.773373 | finish at 2025-09-10 11:56:27 + [2025-09-10 11:33:35] iteration 11691/ 11920 | consumed samples: 11971584 | elapsed time per iteration (ms): 5619.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798296E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:21:26.939025 | finish at 2025-09-10 11:55:02 + [2025-09-10 11:33:41] iteration 11692/ 11920 | consumed samples: 11972608 | elapsed time per iteration (ms): 5825.7 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.84% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802013E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:22:08.263361 | finish at 2025-09-10 11:55:49 + [2025-09-10 11:33:47] iteration 11693/ 11920 | consumed samples: 11973632 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.783604E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:21:16.634540 | finish at 2025-09-10 11:55:03 + [2025-09-10 11:33:52] iteration 11694/ 11920 | consumed samples: 11974656 | elapsed time per iteration (ms): 5630.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798016E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:21:12.373398 | finish at 2025-09-10 11:55:05 + [2025-09-10 11:33:58] iteration 11695/ 11920 | consumed samples: 11975680 | elapsed time per iteration (ms): 5629.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.787514E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:21:06.526651 | finish at 2025-09-10 11:55:05 + [2025-09-10 11:34:04] iteration 11696/ 11920 | consumed samples: 11976704 | elapsed time per iteration (ms): 5901.3 | throughput per GPU (TFLOP/s/GPU): 76.5 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.792061E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:22:01.893646 | finish at 2025-09-10 11:56:06 + [2025-09-10 11:34:10] iteration 11697/ 11920 | consumed samples: 11977728 | elapsed time per iteration (ms): 5633.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.810472E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:20:56.239100 | finish at 2025-09-10 11:55:06 + [2025-09-10 11:34:15] iteration 11698/ 11920 | consumed samples: 11978752 | elapsed time per iteration (ms): 5631.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.794546E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:20:50.204751 | finish at 2025-09-10 11:55:05 + [2025-09-10 11:34:21] iteration 11699/ 11920 | consumed samples: 11979776 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.794291E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:20:43.170893 | finish at 2025-09-10 11:55:04 + [2025-09-10 11:34:26] iteration 11700/ 11920 | consumed samples: 11980800 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799694E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:20:37.543430 | finish at 2025-09-10 11:55:04 + [2025-09-10 11:34:32] iteration 11701/ 11920 | consumed samples: 11981824 | elapsed time per iteration (ms): 5630.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.795619E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:20:33.017487 | finish at 2025-09-10 11:55:05 + [2025-09-10 11:34:38] iteration 11702/ 11920 | consumed samples: 11982848 | elapsed time per iteration (ms): 5622.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.787673E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:20:25.707482 | finish at 2025-09-10 11:55:03 + [2025-09-10 11:34:43] iteration 11703/ 11920 | consumed samples: 11983872 | elapsed time per iteration (ms): 5618.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.780673E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:20:19.289672 | finish at 2025-09-10 11:55:03 + [2025-09-10 11:34:49] iteration 11704/ 11920 | consumed samples: 11984896 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.791286E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:20:14.054695 | finish at 2025-09-10 11:55:03 + [2025-09-10 11:34:55] iteration 11705/ 11920 | consumed samples: 11985920 | elapsed time per iteration (ms): 5633.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.785189E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:20:11.250193 | finish at 2025-09-10 11:55:06 + [2025-09-10 11:35:01] iteration 11706/ 11920 | consumed samples: 11986944 | elapsed time per iteration (ms): 6085.1 | throughput per GPU (TFLOP/s/GPU): 74.2 | MFU 7.50% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.795859E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:21:42.206947 | finish at 2025-09-10 11:56:43 + [2025-09-10 11:35:06] iteration 11707/ 11920 | consumed samples: 11987968 | elapsed time per iteration (ms): 5624.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804806E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:19:58.080768 | finish at 2025-09-10 11:55:04 + [2025-09-10 11:35:12] iteration 11708/ 11920 | consumed samples: 11988992 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799938E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:19:52.240048 | finish at 2025-09-10 11:55:04 + [2025-09-10 11:35:18] iteration 11709/ 11920 | consumed samples: 11990016 | elapsed time per iteration (ms): 5833.8 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800567E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:20:30.927339 | finish at 2025-09-10 11:55:49 + [2025-09-10 11:35:23] iteration 11710/ 11920 | consumed samples: 11991040 | elapsed time per iteration (ms): 5615.8 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.786898E+00 | loss scale: 1.0 | grad norm: 0.136 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:19:39.313924 | finish at 2025-09-10 11:55:03 + [2025-09-10 11:35:29] iteration 11711/ 11920 | consumed samples: 11992064 | elapsed time per iteration (ms): 5622.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.783492E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:19:35.129346 | finish at 2025-09-10 11:55:04 + [2025-09-10 11:35:35] iteration 11712/ 11920 | consumed samples: 11993088 | elapsed time per iteration (ms): 5617.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798573E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:19:28.520847 | finish at 2025-09-10 11:55:03 + [2025-09-10 11:35:40] iteration 11713/ 11920 | consumed samples: 11994112 | elapsed time per iteration (ms): 5618.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.794803E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:19:23.103577 | finish at 2025-09-10 11:55:03 + [2025-09-10 11:35:46] iteration 11714/ 11920 | consumed samples: 11995136 | elapsed time per iteration (ms): 5619.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809971E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:19:17.646845 | finish at 2025-09-10 11:55:03 + [2025-09-10 11:35:51] iteration 11715/ 11920 | consumed samples: 11996160 | elapsed time per iteration (ms): 5632.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799531E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:19:14.651538 | finish at 2025-09-10 11:55:06 + [2025-09-10 11:35:58] iteration 11716/ 11920 | consumed samples: 11997184 | elapsed time per iteration (ms): 6341.9 | throughput per GPU (TFLOP/s/GPU): 71.2 | MFU 7.20% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.791938E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:21:33.756815 | finish at 2025-09-10 11:57:32 + [2025-09-10 11:36:03] iteration 11717/ 11920 | consumed samples: 11998208 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.794920E+00 | loss scale: 1.0 | grad norm: 0.237 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:19:01.925432 | finish at 2025-09-10 11:55:05 + [2025-09-10 11:36:09] iteration 11718/ 11920 | consumed samples: 11999232 | elapsed time per iteration (ms): 5622.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802115E+00 | loss scale: 1.0 | grad norm: 0.253 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:18:55.673711 | finish at 2025-09-10 11:55:05 + [2025-09-10 11:36:15] iteration 11719/ 11920 | consumed samples: 12000256 | elapsed time per iteration (ms): 5633.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.791837E+00 | loss scale: 1.0 | grad norm: 0.248 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:18:52.293840 | finish at 2025-09-10 11:55:07 + [2025-09-10 11:36:20] iteration 11720/ 11920 | consumed samples: 12001280 | elapsed time per iteration (ms): 5626.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799999E+00 | loss scale: 1.0 | grad norm: 0.204 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:18:45.235033 | finish at 2025-09-10 11:55:06 + [2025-09-10 11:36:26] iteration 11721/ 11920 | consumed samples: 12002304 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.790101E+00 | loss scale: 1.0 | grad norm: 0.210 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:18:38.443506 | finish at 2025-09-10 11:55:04 + [2025-09-10 11:36:32] iteration 11722/ 11920 | consumed samples: 12003328 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.788152E+00 | loss scale: 1.0 | grad norm: 0.238 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:18:33.015555 | finish at 2025-09-10 11:55:05 + [2025-09-10 11:36:37] iteration 11723/ 11920 | consumed samples: 12004352 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799994E+00 | loss scale: 1.0 | grad norm: 0.219 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:18:27.249273 | finish at 2025-09-10 11:55:04 + [2025-09-10 11:36:43] iteration 11724/ 11920 | consumed samples: 12005376 | elapsed time per iteration (ms): 5833.2 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.807186E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:19:03.315187 | finish at 2025-09-10 11:55:46 + [2025-09-10 11:36:49] iteration 11725/ 11920 | consumed samples: 12006400 | elapsed time per iteration (ms): 5620.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.791219E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:18:15.943122 | finish at 2025-09-10 11:55:05 + [2025-09-10 11:36:54] iteration 11726/ 11920 | consumed samples: 12007424 | elapsed time per iteration (ms): 5626.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.796114E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:18:11.511516 | finish at 2025-09-10 11:55:06 + [2025-09-10 11:37:00] iteration 11727/ 11920 | consumed samples: 12008448 | elapsed time per iteration (ms): 6020.0 | throughput per GPU (TFLOP/s/GPU): 75.0 | MFU 7.58% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.793621E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:19:21.869383 | finish at 2025-09-10 11:56:22 + [2025-09-10 11:37:06] iteration 11728/ 11920 | consumed samples: 12009472 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798018E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:17:59.205734 | finish at 2025-09-10 11:55:05 + [2025-09-10 11:37:12] iteration 11729/ 11920 | consumed samples: 12010496 | elapsed time per iteration (ms): 5627.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.789189E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:17:54.765124 | finish at 2025-09-10 11:55:06 + [2025-09-10 11:37:17] iteration 11730/ 11920 | consumed samples: 12011520 | elapsed time per iteration (ms): 5633.8 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.789287E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:17:50.425131 | finish at 2025-09-10 11:55:08 + [2025-09-10 11:37:23] iteration 11731/ 11920 | consumed samples: 12012544 | elapsed time per iteration (ms): 5616.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801760E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 11.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:17:41.593598 | finish at 2025-09-10 11:55:04 + [2025-09-10 11:37:28] iteration 11732/ 11920 | consumed samples: 12013568 | elapsed time per iteration (ms): 5626.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.797345E+00 | loss scale: 1.0 | grad norm: 0.136 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:17:37.719004 | finish at 2025-09-10 11:55:06 + [2025-09-10 11:37:34] iteration 11733/ 11920 | consumed samples: 12014592 | elapsed time per iteration (ms): 5622.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805880E+00 | loss scale: 1.0 | grad norm: 0.134 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:17:31.423317 | finish at 2025-09-10 11:55:05 + [2025-09-10 11:37:40] iteration 11734/ 11920 | consumed samples: 12015616 | elapsed time per iteration (ms): 5618.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.792768E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:17:25.116032 | finish at 2025-09-10 11:55:05 + [2025-09-10 11:37:45] iteration 11735/ 11920 | consumed samples: 12016640 | elapsed time per iteration (ms): 5618.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.787364E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:17:19.356073 | finish at 2025-09-10 11:55:05 + [2025-09-10 11:37:51] iteration 11736/ 11920 | consumed samples: 12017664 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.785800E+00 | loss scale: 1.0 | grad norm: 0.138 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:17:14.991928 | finish at 2025-09-10 11:55:06 + [2025-09-10 11:37:57] iteration 11737/ 11920 | consumed samples: 12018688 | elapsed time per iteration (ms): 5636.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.791504E+00 | loss scale: 1.0 | grad norm: 0.132 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:17:11.485761 | finish at 2025-09-10 11:55:08 + [2025-09-10 11:38:03] iteration 11738/ 11920 | consumed samples: 12019712 | elapsed time per iteration (ms): 5991.8 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.62% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.792871E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:18:10.504228 | finish at 2025-09-10 11:56:13 + [2025-09-10 11:38:08] iteration 11739/ 11920 | consumed samples: 12020736 | elapsed time per iteration (ms): 5638.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.792345E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:17:00.600861 | finish at 2025-09-10 11:55:09 + [2025-09-10 11:38:14] iteration 11740/ 11920 | consumed samples: 12021760 | elapsed time per iteration (ms): 5633.2 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.780847E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:16:53.975644 | finish at 2025-09-10 11:55:08 + [2025-09-10 11:38:20] iteration 11741/ 11920 | consumed samples: 12022784 | elapsed time per iteration (ms): 5939.5 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799503E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:17:43.164115 | finish at 2025-09-10 11:56:03 + [2025-09-10 11:38:25] iteration 11742/ 11920 | consumed samples: 12023808 | elapsed time per iteration (ms): 5620.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803902E+00 | loss scale: 1.0 | grad norm: 0.212 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:16:40.464803 | finish at 2025-09-10 11:55:06 + [2025-09-10 11:38:31] iteration 11743/ 11920 | consumed samples: 12024832 | elapsed time per iteration (ms): 5623.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.785635E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:16:35.412607 | finish at 2025-09-10 11:55:06 + [2025-09-10 11:38:37] iteration 11744/ 11920 | consumed samples: 12025856 | elapsed time per iteration (ms): 5622.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805460E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:16:29.584915 | finish at 2025-09-10 11:55:06 + [2025-09-10 11:38:43] iteration 11745/ 11920 | consumed samples: 12026880 | elapsed time per iteration (ms): 5928.9 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798408E+00 | loss scale: 1.0 | grad norm: 0.230 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:17:17.551951 | finish at 2025-09-10 11:56:00 + [2025-09-10 11:38:48] iteration 11746/ 11920 | consumed samples: 12027904 | elapsed time per iteration (ms): 5621.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.789891E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:16:18.047828 | finish at 2025-09-10 11:55:06 + [2025-09-10 11:38:54] iteration 11747/ 11920 | consumed samples: 12028928 | elapsed time per iteration (ms): 5631.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.788495E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:16:14.309473 | finish at 2025-09-10 11:55:08 + [2025-09-10 11:38:59] iteration 11748/ 11920 | consumed samples: 12029952 | elapsed time per iteration (ms): 5624.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.787291E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:16:07.437914 | finish at 2025-09-10 11:55:07 + [2025-09-10 11:39:05] iteration 11749/ 11920 | consumed samples: 12030976 | elapsed time per iteration (ms): 5616.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.790211E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:16:00.390743 | finish at 2025-09-10 11:55:05 + [2025-09-10 11:39:11] iteration 11750/ 11920 | consumed samples: 12032000 | elapsed time per iteration (ms): 5634.0 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.794981E+00 | loss scale: 1.0 | grad norm: 0.205 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:15:57.779403 | finish at 2025-09-10 11:55:08 + [2025-09-10 11:39:16] iteration 11751/ 11920 | consumed samples: 12033024 | elapsed time per iteration (ms): 5624.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805016E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:15:50.477891 | finish at 2025-09-10 11:55:07 + [2025-09-10 11:39:22] iteration 11752/ 11920 | consumed samples: 12034048 | elapsed time per iteration (ms): 5920.8 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.794553E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:16:34.697239 | finish at 2025-09-10 11:55:57 + [2025-09-10 11:39:28] iteration 11753/ 11920 | consumed samples: 12035072 | elapsed time per iteration (ms): 5622.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.794095E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:15:38.909871 | finish at 2025-09-10 11:55:07 + [2025-09-10 11:39:33] iteration 11754/ 11920 | consumed samples: 12036096 | elapsed time per iteration (ms): 5629.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.790816E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:15:34.501101 | finish at 2025-09-10 11:55:08 + [2025-09-10 11:39:39] iteration 11755/ 11920 | consumed samples: 12037120 | elapsed time per iteration (ms): 5629.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.787798E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:15:28.854268 | finish at 2025-09-10 11:55:08 + [2025-09-10 11:39:45] iteration 11756/ 11920 | consumed samples: 12038144 | elapsed time per iteration (ms): 5629.1 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801193E+00 | loss scale: 1.0 | grad norm: 0.169 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:15:23.173900 | finish at 2025-09-10 11:55:08 + [2025-09-10 11:39:51] iteration 11757/ 11920 | consumed samples: 12039168 | elapsed time per iteration (ms): 5970.4 | throughput per GPU (TFLOP/s/GPU): 75.6 | MFU 7.65% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.778081E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:16:13.170630 | finish at 2025-09-10 11:56:04 + [2025-09-10 11:39:56] iteration 11758/ 11920 | consumed samples: 12040192 | elapsed time per iteration (ms): 5634.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.782018E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:15:12.755788 | finish at 2025-09-10 11:55:09 + [2025-09-10 11:40:02] iteration 11759/ 11920 | consumed samples: 12041216 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.797657E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:15:05.453072 | finish at 2025-09-10 11:55:07 + [2025-09-10 11:40:08] iteration 11760/ 11920 | consumed samples: 12042240 | elapsed time per iteration (ms): 5914.0 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.72% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799152E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:15:46.242332 | finish at 2025-09-10 11:55:54 + [2025-09-10 11:40:13] iteration 11761/ 11920 | consumed samples: 12043264 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808632E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:14:54.043186 | finish at 2025-09-10 11:55:08 + [2025-09-10 11:40:19] iteration 11762/ 11920 | consumed samples: 12044288 | elapsed time per iteration (ms): 5617.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.783753E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:14:47.552918 | finish at 2025-09-10 11:55:07 + [2025-09-10 11:40:25] iteration 11763/ 11920 | consumed samples: 12045312 | elapsed time per iteration (ms): 5619.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805087E+00 | loss scale: 1.0 | grad norm: 0.198 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:14:42.227462 | finish at 2025-09-10 11:55:07 + [2025-09-10 11:40:30] iteration 11764/ 11920 | consumed samples: 12046336 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.790192E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:14:37.508964 | finish at 2025-09-10 11:55:08 + [2025-09-10 11:40:36] iteration 11765/ 11920 | consumed samples: 12047360 | elapsed time per iteration (ms): 5837.7 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802174E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:15:04.840305 | finish at 2025-09-10 11:55:41 + [2025-09-10 11:40:42] iteration 11766/ 11920 | consumed samples: 12048384 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798865E+00 | loss scale: 1.0 | grad norm: 0.217 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:14:25.665547 | finish at 2025-09-10 11:55:07 + [2025-09-10 11:40:48] iteration 11767/ 11920 | consumed samples: 12049408 | elapsed time per iteration (ms): 5868.6 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800682E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:14:57.893923 | finish at 2025-09-10 11:55:46 + [2025-09-10 11:40:53] iteration 11768/ 11920 | consumed samples: 12050432 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.794245E+00 | loss scale: 1.0 | grad norm: 0.242 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:14:15.001377 | finish at 2025-09-10 11:55:08 + [2025-09-10 11:40:59] iteration 11769/ 11920 | consumed samples: 12051456 | elapsed time per iteration (ms): 5899.9 | throughput per GPU (TFLOP/s/GPU): 76.5 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.796172E+00 | loss scale: 1.0 | grad norm: 0.257 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:14:50.878090 | finish at 2025-09-10 11:55:50 + [2025-09-10 11:41:05] iteration 11770/ 11920 | consumed samples: 12052480 | elapsed time per iteration (ms): 5626.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800435E+00 | loss scale: 1.0 | grad norm: 0.261 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:14:03.898058 | finish at 2025-09-10 11:55:09 + [2025-09-10 11:41:10] iteration 11771/ 11920 | consumed samples: 12053504 | elapsed time per iteration (ms): 5618.3 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804896E+00 | loss scale: 1.0 | grad norm: 0.255 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:13:57.120335 | finish at 2025-09-10 11:55:08 + [2025-09-10 11:41:16] iteration 11772/ 11920 | consumed samples: 12054528 | elapsed time per iteration (ms): 5633.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809858E+00 | loss scale: 1.0 | grad norm: 0.267 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:13:53.730597 | finish at 2025-09-10 11:55:10 + [2025-09-10 11:41:22] iteration 11773/ 11920 | consumed samples: 12055552 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.818583E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:13:46.723665 | finish at 2025-09-10 11:55:08 + [2025-09-10 11:41:27] iteration 11774/ 11920 | consumed samples: 12056576 | elapsed time per iteration (ms): 5622.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801400E+00 | loss scale: 1.0 | grad norm: 0.256 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:13:40.916703 | finish at 2025-09-10 11:55:08 + [2025-09-10 11:41:33] iteration 11775/ 11920 | consumed samples: 12057600 | elapsed time per iteration (ms): 5620.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798319E+00 | loss scale: 1.0 | grad norm: 0.277 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:13:34.913570 | finish at 2025-09-10 11:55:08 + [2025-09-10 11:41:39] iteration 11776/ 11920 | consumed samples: 12058624 | elapsed time per iteration (ms): 5619.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811587E+00 | loss scale: 1.0 | grad norm: 0.276 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:13:29.186359 | finish at 2025-09-10 11:55:08 + [2025-09-10 11:41:44] iteration 11777/ 11920 | consumed samples: 12059648 | elapsed time per iteration (ms): 5621.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805669E+00 | loss scale: 1.0 | grad norm: 0.265 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:13:23.912653 | finish at 2025-09-10 11:55:08 + [2025-09-10 11:41:50] iteration 11778/ 11920 | consumed samples: 12060672 | elapsed time per iteration (ms): 5625.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.813148E+00 | loss scale: 1.0 | grad norm: 0.209 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:13:18.838735 | finish at 2025-09-10 11:55:09 + [2025-09-10 11:41:56] iteration 11779/ 11920 | consumed samples: 12061696 | elapsed time per iteration (ms): 6228.9 | throughput per GPU (TFLOP/s/GPU): 72.5 | MFU 7.33% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.797487E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:14:38.274692 | finish at 2025-09-10 11:56:34 + [2025-09-10 11:42:02] iteration 11780/ 11920 | consumed samples: 12062720 | elapsed time per iteration (ms): 5633.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804700E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:13:08.719587 | finish at 2025-09-10 11:55:10 + [2025-09-10 11:42:08] iteration 11781/ 11920 | consumed samples: 12063744 | elapsed time per iteration (ms): 5861.4 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801307E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:13:34.737932 | finish at 2025-09-10 11:55:42 + [2025-09-10 11:42:13] iteration 11782/ 11920 | consumed samples: 12064768 | elapsed time per iteration (ms): 5631.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800872E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:12:57.074288 | finish at 2025-09-10 11:55:10 + [2025-09-10 11:42:19] iteration 11783/ 11920 | consumed samples: 12065792 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.786283E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:12:50.113329 | finish at 2025-09-10 11:55:09 + [2025-09-10 11:42:24] iteration 11784/ 11920 | consumed samples: 12066816 | elapsed time per iteration (ms): 5625.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799802E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:12:45.115173 | finish at 2025-09-10 11:55:10 + [2025-09-10 11:42:30] iteration 11785/ 11920 | consumed samples: 12067840 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805082E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:12:39.328362 | finish at 2025-09-10 11:55:09 + [2025-09-10 11:42:36] iteration 11786/ 11920 | consumed samples: 12068864 | elapsed time per iteration (ms): 5927.3 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.793495E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:13:14.261040 | finish at 2025-09-10 11:55:50 + [2025-09-10 11:42:42] iteration 11787/ 11920 | consumed samples: 12069888 | elapsed time per iteration (ms): 5936.2 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.796036E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:13:09.509831 | finish at 2025-09-10 11:55:51 + [2025-09-10 11:42:48] iteration 11788/ 11920 | consumed samples: 12070912 | elapsed time per iteration (ms): 5636.6 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.782397E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:12:24.030069 | finish at 2025-09-10 11:55:12 + [2025-09-10 11:42:53] iteration 11789/ 11920 | consumed samples: 12071936 | elapsed time per iteration (ms): 5902.5 | throughput per GPU (TFLOP/s/GPU): 76.5 | MFU 7.73% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.796040E+00 | loss scale: 1.0 | grad norm: 0.168 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:12:53.222773 | finish at 2025-09-10 11:55:47 + [2025-09-10 11:42:59] iteration 11790/ 11920 | consumed samples: 12072960 | elapsed time per iteration (ms): 5623.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.815294E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:12:11.020517 | finish at 2025-09-10 11:55:10 + [2025-09-10 11:43:05] iteration 11791/ 11920 | consumed samples: 12073984 | elapsed time per iteration (ms): 5626.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.785931E+00 | loss scale: 1.0 | grad norm: 0.170 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:12:05.826698 | finish at 2025-09-10 11:55:11 + [2025-09-10 11:43:11] iteration 11792/ 11920 | consumed samples: 12075008 | elapsed time per iteration (ms): 5873.4 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.77% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.790491E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:12:31.797028 | finish at 2025-09-10 11:55:42 + [2025-09-10 11:43:16] iteration 11793/ 11920 | consumed samples: 12076032 | elapsed time per iteration (ms): 5630.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.786907E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:11:55.110935 | finish at 2025-09-10 11:55:11 + [2025-09-10 11:43:22] iteration 11794/ 11920 | consumed samples: 12077056 | elapsed time per iteration (ms): 5615.1 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.792010E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:11:47.502588 | finish at 2025-09-10 11:55:09 + [2025-09-10 11:43:27] iteration 11795/ 11920 | consumed samples: 12078080 | elapsed time per iteration (ms): 5624.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.790169E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:11:43.062683 | finish at 2025-09-10 11:55:10 + [2025-09-10 11:43:33] iteration 11796/ 11920 | consumed samples: 12079104 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.785305E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:11:37.360813 | finish at 2025-09-10 11:55:10 + [2025-09-10 11:43:39] iteration 11797/ 11920 | consumed samples: 12080128 | elapsed time per iteration (ms): 5936.4 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.790607E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:12:10.183087 | finish at 2025-09-10 11:55:49 + [2025-09-10 11:43:45] iteration 11798/ 11920 | consumed samples: 12081152 | elapsed time per iteration (ms): 5624.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.792112E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:11:26.170650 | finish at 2025-09-10 11:55:11 + [2025-09-10 11:43:50] iteration 11799/ 11920 | consumed samples: 12082176 | elapsed time per iteration (ms): 5620.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808053E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:11:20.134660 | finish at 2025-09-10 11:55:10 + [2025-09-10 11:43:56] iteration 11800/ 11920 | consumed samples: 12083200 | elapsed time per iteration (ms): 5953.6 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.792101E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:11:54.429245 | finish at 2025-09-10 11:55:51 + [2025-09-10 11:44:02] iteration 11801/ 11920 | consumed samples: 12084224 | elapsed time per iteration (ms): 5632.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.789430E+00 | loss scale: 1.0 | grad norm: 0.149 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:11:10.242610 | finish at 2025-09-10 11:55:12 + [2025-09-10 11:44:07] iteration 11802/ 11920 | consumed samples: 12085248 | elapsed time per iteration (ms): 5618.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.787492E+00 | loss scale: 1.0 | grad norm: 0.147 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:11:03.033949 | finish at 2025-09-10 11:55:10 + [2025-09-10 11:44:13] iteration 11803/ 11920 | consumed samples: 12086272 | elapsed time per iteration (ms): 5867.5 | throughput per GPU (TFLOP/s/GPU): 76.9 | MFU 7.78% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.788365E+00 | loss scale: 1.0 | grad norm: 0.151 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:11:26.498149 | finish at 2025-09-10 11:55:40 + [2025-09-10 11:44:19] iteration 11804/ 11920 | consumed samples: 12087296 | elapsed time per iteration (ms): 5617.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.787086E+00 | loss scale: 1.0 | grad norm: 0.157 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:10:51.670746 | finish at 2025-09-10 11:55:11 + [2025-09-10 11:44:25] iteration 11805/ 11920 | consumed samples: 12088320 | elapsed time per iteration (ms): 5632.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.793721E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:10:47.720082 | finish at 2025-09-10 11:55:12 + [2025-09-10 11:44:30] iteration 11806/ 11920 | consumed samples: 12089344 | elapsed time per iteration (ms): 5861.9 | throughput per GPU (TFLOP/s/GPU): 77.0 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.777404E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:11:08.251772 | finish at 2025-09-10 11:55:39 + [2025-09-10 11:44:36] iteration 11807/ 11920 | consumed samples: 12090368 | elapsed time per iteration (ms): 5621.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806573E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:10:35.244724 | finish at 2025-09-10 11:55:11 + [2025-09-10 11:44:42] iteration 11808/ 11920 | consumed samples: 12091392 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.793394E+00 | loss scale: 1.0 | grad norm: 0.235 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:10:29.879143 | finish at 2025-09-10 11:55:12 + [2025-09-10 11:44:47] iteration 11809/ 11920 | consumed samples: 12092416 | elapsed time per iteration (ms): 5633.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.828047E+00 | loss scale: 1.0 | grad norm: 0.240 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:10:25.295540 | finish at 2025-09-10 11:55:13 + [2025-09-10 11:44:53] iteration 11810/ 11920 | consumed samples: 12093440 | elapsed time per iteration (ms): 5628.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802721E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:10:19.175518 | finish at 2025-09-10 11:55:12 + [2025-09-10 11:44:59] iteration 11811/ 11920 | consumed samples: 12094464 | elapsed time per iteration (ms): 5630.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.786326E+00 | loss scale: 1.0 | grad norm: 0.200 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:10:13.731213 | finish at 2025-09-10 11:55:12 + [2025-09-10 11:45:04] iteration 11812/ 11920 | consumed samples: 12095488 | elapsed time per iteration (ms): 5894.6 | throughput per GPU (TFLOP/s/GPU): 76.6 | MFU 7.74% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.773929E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:10:36.618825 | finish at 2025-09-10 11:55:41 + [2025-09-10 11:45:10] iteration 11813/ 11920 | consumed samples: 12096512 | elapsed time per iteration (ms): 5626.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.793267E+00 | loss scale: 1.0 | grad norm: 0.230 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:10:02.075183 | finish at 2025-09-10 11:55:12 + [2025-09-10 11:45:16] iteration 11814/ 11920 | consumed samples: 12097536 | elapsed time per iteration (ms): 5976.8 | throughput per GPU (TFLOP/s/GPU): 75.5 | MFU 7.64% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800287E+00 | loss scale: 1.0 | grad norm: 0.255 | num zeros: 10.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:10:33.539634 | finish at 2025-09-10 11:55:50 + [2025-09-10 11:45:22] iteration 11815/ 11920 | consumed samples: 12098560 | elapsed time per iteration (ms): 5631.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.796254E+00 | loss scale: 1.0 | grad norm: 0.254 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:09:51.313884 | finish at 2025-09-10 11:55:13 + [2025-09-10 11:45:27] iteration 11816/ 11920 | consumed samples: 12099584 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804419E+00 | loss scale: 1.0 | grad norm: 0.226 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:09:44.652292 | finish at 2025-09-10 11:55:12 + [2025-09-10 11:45:33] iteration 11817/ 11920 | consumed samples: 12100608 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.808612E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:09:38.939651 | finish at 2025-09-10 11:55:12 + [2025-09-10 11:45:39] iteration 11818/ 11920 | consumed samples: 12101632 | elapsed time per iteration (ms): 5923.7 | throughput per GPU (TFLOP/s/GPU): 76.2 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809279E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:10:04.212667 | finish at 2025-09-10 11:55:43 + [2025-09-10 11:45:44] iteration 11819/ 11920 | consumed samples: 12102656 | elapsed time per iteration (ms): 5620.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800057E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:09:27.651172 | finish at 2025-09-10 11:55:12 + [2025-09-10 11:45:50] iteration 11820/ 11920 | consumed samples: 12103680 | elapsed time per iteration (ms): 5619.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.797264E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:09:21.908603 | finish at 2025-09-10 11:55:12 + [2025-09-10 11:45:56] iteration 11821/ 11920 | consumed samples: 12104704 | elapsed time per iteration (ms): 5626.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.787592E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:09:17.016833 | finish at 2025-09-10 11:55:13 + [2025-09-10 11:46:01] iteration 11822/ 11920 | consumed samples: 12105728 | elapsed time per iteration (ms): 5627.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.792517E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:09:11.504912 | finish at 2025-09-10 11:55:13 + [2025-09-10 11:46:07] iteration 11823/ 11920 | consumed samples: 12106752 | elapsed time per iteration (ms): 5849.9 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.781664E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:09:27.442197 | finish at 2025-09-10 11:55:35 + [2025-09-10 11:46:13] iteration 11824/ 11920 | consumed samples: 12107776 | elapsed time per iteration (ms): 5634.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.782814E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:09:00.874649 | finish at 2025-09-10 11:55:14 + [2025-09-10 11:46:18] iteration 11825/ 11920 | consumed samples: 12108800 | elapsed time per iteration (ms): 5631.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.794199E+00 | loss scale: 1.0 | grad norm: 0.166 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:08:54.945978 | finish at 2025-09-10 11:55:13 + [2025-09-10 11:46:24] iteration 11826/ 11920 | consumed samples: 12109824 | elapsed time per iteration (ms): 5628.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.791996E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:08:49.029223 | finish at 2025-09-10 11:55:13 + [2025-09-10 11:46:30] iteration 11827/ 11920 | consumed samples: 12110848 | elapsed time per iteration (ms): 5626.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801659E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:08:43.278924 | finish at 2025-09-10 11:55:13 + [2025-09-10 11:46:35] iteration 11828/ 11920 | consumed samples: 12111872 | elapsed time per iteration (ms): 5632.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799075E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:08:38.184115 | finish at 2025-09-10 11:55:14 + [2025-09-10 11:46:41] iteration 11829/ 11920 | consumed samples: 12112896 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.786397E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:08:31.676134 | finish at 2025-09-10 11:55:13 + [2025-09-10 11:46:47] iteration 11830/ 11920 | consumed samples: 12113920 | elapsed time per iteration (ms): 5641.1 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.783760E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:08:27.700903 | finish at 2025-09-10 11:55:14 + [2025-09-10 11:46:52] iteration 11831/ 11920 | consumed samples: 12114944 | elapsed time per iteration (ms): 5632.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.788568E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:08:21.269153 | finish at 2025-09-10 11:55:14 + [2025-09-10 11:46:58] iteration 11832/ 11920 | consumed samples: 12115968 | elapsed time per iteration (ms): 5634.9 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.809251E+00 | loss scale: 1.0 | grad norm: 0.207 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:08:15.874481 | finish at 2025-09-10 11:55:14 + [2025-09-10 11:47:04] iteration 11833/ 11920 | consumed samples: 12116992 | elapsed time per iteration (ms): 6268.7 | throughput per GPU (TFLOP/s/GPU): 72.0 | MFU 7.28% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.785525E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:09:05.381121 | finish at 2025-09-10 11:56:10 + [2025-09-10 11:47:10] iteration 11834/ 11920 | consumed samples: 12118016 | elapsed time per iteration (ms): 5623.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.792948E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:08:03.655579 | finish at 2025-09-10 11:55:13 + [2025-09-10 11:47:16] iteration 11835/ 11920 | consumed samples: 12119040 | elapsed time per iteration (ms): 5848.8 | throughput per GPU (TFLOP/s/GPU): 77.2 | MFU 7.81% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.783562E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:08:17.149819 | finish at 2025-09-10 11:55:33 + [2025-09-10 11:47:21] iteration 11836/ 11920 | consumed samples: 12120064 | elapsed time per iteration (ms): 5626.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.784662E+00 | loss scale: 1.0 | grad norm: 0.185 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:07:52.624008 | finish at 2025-09-10 11:55:14 + [2025-09-10 11:47:27] iteration 11837/ 11920 | consumed samples: 12121088 | elapsed time per iteration (ms): 5635.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.787726E+00 | loss scale: 1.0 | grad norm: 0.189 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:07:47.740837 | finish at 2025-09-10 11:55:15 + [2025-09-10 11:47:33] iteration 11838/ 11920 | consumed samples: 12122112 | elapsed time per iteration (ms): 5852.8 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.80% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.796087E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:07:59.926913 | finish at 2025-09-10 11:55:33 + [2025-09-10 11:47:38] iteration 11839/ 11920 | consumed samples: 12123136 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.796609E+00 | loss scale: 1.0 | grad norm: 0.177 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:07:35.679131 | finish at 2025-09-10 11:55:14 + [2025-09-10 11:47:44] iteration 11840/ 11920 | consumed samples: 12124160 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.795130E+00 | loss scale: 1.0 | grad norm: 0.174 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:07:29.874878 | finish at 2025-09-10 11:55:14 + [2025-09-10 11:47:50] iteration 11841/ 11920 | consumed samples: 12125184 | elapsed time per iteration (ms): 5624.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.793934E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:07:24.350157 | finish at 2025-09-10 11:55:14 + [2025-09-10 11:47:55] iteration 11842/ 11920 | consumed samples: 12126208 | elapsed time per iteration (ms): 5634.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806670E+00 | loss scale: 1.0 | grad norm: 0.225 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:07:19.472517 | finish at 2025-09-10 11:55:15 + [2025-09-10 11:48:01] iteration 11843/ 11920 | consumed samples: 12127232 | elapsed time per iteration (ms): 5630.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.791480E+00 | loss scale: 1.0 | grad norm: 0.232 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:07:13.576521 | finish at 2025-09-10 11:55:14 + [2025-09-10 11:48:07] iteration 11844/ 11920 | consumed samples: 12128256 | elapsed time per iteration (ms): 5629.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.804567E+00 | loss scale: 1.0 | grad norm: 0.231 | num zeros: 9.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:07:07.849042 | finish at 2025-09-10 11:55:14 + [2025-09-10 11:48:12] iteration 11845/ 11920 | consumed samples: 12129280 | elapsed time per iteration (ms): 5827.0 | throughput per GPU (TFLOP/s/GPU): 77.5 | MFU 7.83% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.785818E+00 | loss scale: 1.0 | grad norm: 0.260 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:07:17.027067 | finish at 2025-09-10 11:55:29 + [2025-09-10 11:48:18] iteration 11846/ 11920 | consumed samples: 12130304 | elapsed time per iteration (ms): 5638.4 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.802087E+00 | loss scale: 1.0 | grad norm: 0.290 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:06:57.243546 | finish at 2025-09-10 11:55:15 + [2025-09-10 11:48:24] iteration 11847/ 11920 | consumed samples: 12131328 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.794701E+00 | loss scale: 1.0 | grad norm: 0.321 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:06:50.512340 | finish at 2025-09-10 11:55:14 + [2025-09-10 11:48:29] iteration 11848/ 11920 | consumed samples: 12132352 | elapsed time per iteration (ms): 5621.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.791749E+00 | loss scale: 1.0 | grad norm: 0.259 | num zeros: 10.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:06:44.728552 | finish at 2025-09-10 11:55:14 + [2025-09-10 11:48:35] iteration 11849/ 11920 | consumed samples: 12133376 | elapsed time per iteration (ms): 5627.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.792763E+00 | loss scale: 1.0 | grad norm: 0.193 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:06:39.574578 | finish at 2025-09-10 11:55:14 + [2025-09-10 11:48:40] iteration 11850/ 11920 | consumed samples: 12134400 | elapsed time per iteration (ms): 5624.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800626E+00 | loss scale: 1.0 | grad norm: 0.175 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:06:33.682392 | finish at 2025-09-10 11:55:14 + [2025-09-10 11:48:46] iteration 11851/ 11920 | consumed samples: 12135424 | elapsed time per iteration (ms): 5620.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.791624E+00 | loss scale: 1.0 | grad norm: 0.160 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:06:27.776801 | finish at 2025-09-10 11:55:14 + [2025-09-10 11:48:52] iteration 11852/ 11920 | consumed samples: 12136448 | elapsed time per iteration (ms): 5619.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.780371E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:06:22.142499 | finish at 2025-09-10 11:55:14 + [2025-09-10 11:48:57] iteration 11853/ 11920 | consumed samples: 12137472 | elapsed time per iteration (ms): 5629.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.777768E+00 | loss scale: 1.0 | grad norm: 0.161 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:06:17.189369 | finish at 2025-09-10 11:55:15 + [2025-09-10 11:49:03] iteration 11854/ 11920 | consumed samples: 12138496 | elapsed time per iteration (ms): 5628.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800051E+00 | loss scale: 1.0 | grad norm: 0.153 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:06:11.508316 | finish at 2025-09-10 11:55:14 + [2025-09-10 11:49:09] iteration 11855/ 11920 | consumed samples: 12139520 | elapsed time per iteration (ms): 5622.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.793236E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:06:05.482813 | finish at 2025-09-10 11:55:14 + [2025-09-10 11:49:14] iteration 11856/ 11920 | consumed samples: 12140544 | elapsed time per iteration (ms): 5859.0 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.788793E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:06:14.975403 | finish at 2025-09-10 11:55:29 + [2025-09-10 11:49:20] iteration 11857/ 11920 | consumed samples: 12141568 | elapsed time per iteration (ms): 5618.9 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.811867E+00 | loss scale: 1.0 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:05:53.989502 | finish at 2025-09-10 11:55:14 + [2025-09-10 11:49:26] iteration 11858/ 11920 | consumed samples: 12142592 | elapsed time per iteration (ms): 5619.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803581E+00 | loss scale: 1.0 | grad norm: 0.172 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:05:48.388315 | finish at 2025-09-10 11:55:14 + [2025-09-10 11:49:31] iteration 11859/ 11920 | consumed samples: 12143616 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.797341E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:05:43.026729 | finish at 2025-09-10 11:55:14 + [2025-09-10 11:49:37] iteration 11860/ 11920 | consumed samples: 12144640 | elapsed time per iteration (ms): 5986.2 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.785772E+00 | loss scale: 1.0 | grad norm: 0.187 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:05:59.174266 | finish at 2025-09-10 11:55:36 + [2025-09-10 11:49:43] iteration 11861/ 11920 | consumed samples: 12145664 | elapsed time per iteration (ms): 5627.7 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.789138E+00 | loss scale: 1.0 | grad norm: 0.201 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:05:32.035346 | finish at 2025-09-10 11:55:15 + [2025-09-10 11:49:49] iteration 11862/ 11920 | consumed samples: 12146688 | elapsed time per iteration (ms): 5622.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.801692E+00 | loss scale: 1.0 | grad norm: 0.238 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:05:26.094750 | finish at 2025-09-10 11:55:15 + [2025-09-10 11:49:54] iteration 11863/ 11920 | consumed samples: 12147712 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805143E+00 | loss scale: 1.0 | grad norm: 0.229 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:05:20.532467 | finish at 2025-09-10 11:55:15 + [2025-09-10 11:50:00] iteration 11864/ 11920 | consumed samples: 12148736 | elapsed time per iteration (ms): 5629.8 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.790612E+00 | loss scale: 1.0 | grad norm: 0.218 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:05:15.269953 | finish at 2025-09-10 11:55:15 + [2025-09-10 11:50:05] iteration 11865/ 11920 | consumed samples: 12149760 | elapsed time per iteration (ms): 5620.8 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798076E+00 | loss scale: 1.0 | grad norm: 0.215 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:05:09.143962 | finish at 2025-09-10 11:55:15 + [2025-09-10 11:50:11] iteration 11866/ 11920 | consumed samples: 12150784 | elapsed time per iteration (ms): 5939.8 | throughput per GPU (TFLOP/s/GPU): 76.0 | MFU 7.69% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.812430E+00 | loss scale: 1.0 | grad norm: 0.220 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:05:20.747695 | finish at 2025-09-10 11:55:32 + [2025-09-10 11:50:17] iteration 11867/ 11920 | consumed samples: 12151808 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.793523E+00 | loss scale: 1.0 | grad norm: 0.239 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:04:58.041083 | finish at 2025-09-10 11:55:15 + [2025-09-10 11:50:23] iteration 11868/ 11920 | consumed samples: 12152832 | elapsed time per iteration (ms): 5623.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.805719E+00 | loss scale: 1.0 | grad norm: 0.216 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:04:52.418807 | finish at 2025-09-10 11:55:15 + [2025-09-10 11:50:28] iteration 11869/ 11920 | consumed samples: 12153856 | elapsed time per iteration (ms): 5636.5 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.795487E+00 | loss scale: 1.0 | grad norm: 0.202 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:04:47.463087 | finish at 2025-09-10 11:55:16 + [2025-09-10 11:50:34] iteration 11870/ 11920 | consumed samples: 12154880 | elapsed time per iteration (ms): 5622.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.785896E+00 | loss scale: 1.0 | grad norm: 0.197 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:04:41.144512 | finish at 2025-09-10 11:55:15 + [2025-09-10 11:50:39] iteration 11871/ 11920 | consumed samples: 12155904 | elapsed time per iteration (ms): 5619.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.794741E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:04:35.337996 | finish at 2025-09-10 11:55:15 + [2025-09-10 11:50:45] iteration 11872/ 11920 | consumed samples: 12156928 | elapsed time per iteration (ms): 5622.4 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.790176E+00 | loss scale: 1.0 | grad norm: 0.184 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:04:29.873211 | finish at 2025-09-10 11:55:15 + [2025-09-10 11:50:51] iteration 11873/ 11920 | consumed samples: 12157952 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.787123E+00 | loss scale: 1.0 | grad norm: 0.171 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:04:24.221124 | finish at 2025-09-10 11:55:15 + [2025-09-10 11:50:57] iteration 11874/ 11920 | consumed samples: 12158976 | elapsed time per iteration (ms): 5859.4 | throughput per GPU (TFLOP/s/GPU): 77.1 | MFU 7.79% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.787292E+00 | loss scale: 1.0 | grad norm: 0.186 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:04:29.532753 | finish at 2025-09-10 11:55:26 + [2025-09-10 11:51:02] iteration 11875/ 11920 | consumed samples: 12160000 | elapsed time per iteration (ms): 5627.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.806377E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:04:13.224306 | finish at 2025-09-10 11:55:15 + [2025-09-10 11:51:08] iteration 11876/ 11920 | consumed samples: 12161024 | elapsed time per iteration (ms): 5622.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.791425E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:04:07.395201 | finish at 2025-09-10 11:55:15 + [2025-09-10 11:51:13] iteration 11877/ 11920 | consumed samples: 12162048 | elapsed time per iteration (ms): 5625.2 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.790153E+00 | loss scale: 1.0 | grad norm: 0.145 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:04:01.882269 | finish at 2025-09-10 11:55:15 + [2025-09-10 11:51:19] iteration 11878/ 11920 | consumed samples: 12163072 | elapsed time per iteration (ms): 5618.4 | throughput per GPU (TFLOP/s/GPU): 80.4 | MFU 8.13% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.797856E+00 | loss scale: 1.0 | grad norm: 0.136 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:03:55.972293 | finish at 2025-09-10 11:55:15 + [2025-09-10 11:51:25] iteration 11879/ 11920 | consumed samples: 12164096 | elapsed time per iteration (ms): 5627.3 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.794819E+00 | loss scale: 1.0 | grad norm: 0.140 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:03:50.721295 | finish at 2025-09-10 11:55:15 + [2025-09-10 11:51:31] iteration 11880/ 11920 | consumed samples: 12165120 | elapsed time per iteration (ms): 5984.3 | throughput per GPU (TFLOP/s/GPU): 75.4 | MFU 7.63% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803793E+00 | loss scale: 1.0 | grad norm: 0.146 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:03:59.373207 | finish at 2025-09-10 11:55:30 + [2025-09-10 11:51:36] iteration 11881/ 11920 | consumed samples: 12166144 | elapsed time per iteration (ms): 5628.5 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.789859E+00 | loss scale: 1.0 | grad norm: 0.152 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:03:39.512532 | finish at 2025-09-10 11:55:16 + [2025-09-10 11:51:42] iteration 11882/ 11920 | consumed samples: 12167168 | elapsed time per iteration (ms): 5623.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.797459E+00 | loss scale: 1.0 | grad norm: 0.178 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:03:33.702227 | finish at 2025-09-10 11:55:16 + [2025-09-10 11:51:48] iteration 11883/ 11920 | consumed samples: 12168192 | elapsed time per iteration (ms): 5619.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.800654E+00 | loss scale: 1.0 | grad norm: 0.195 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:03:27.928792 | finish at 2025-09-10 11:55:16 + [2025-09-10 11:51:53] iteration 11884/ 11920 | consumed samples: 12169216 | elapsed time per iteration (ms): 5630.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.791544E+00 | loss scale: 1.0 | grad norm: 0.190 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:03:22.700003 | finish at 2025-09-10 11:55:16 + [2025-09-10 11:51:59] iteration 11885/ 11920 | consumed samples: 12170240 | elapsed time per iteration (ms): 5633.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798080E+00 | loss scale: 1.0 | grad norm: 0.173 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:03:17.166529 | finish at 2025-09-10 11:55:16 + [2025-09-10 11:52:04] iteration 11886/ 11920 | consumed samples: 12171264 | elapsed time per iteration (ms): 5621.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.786620E+00 | loss scale: 1.0 | grad norm: 0.165 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:03:11.138483 | finish at 2025-09-10 11:55:16 + [2025-09-10 11:52:10] iteration 11887/ 11920 | consumed samples: 12172288 | elapsed time per iteration (ms): 5954.7 | throughput per GPU (TFLOP/s/GPU): 75.8 | MFU 7.67% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.788647E+00 | loss scale: 1.0 | grad norm: 0.156 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:03:16.506673 | finish at 2025-09-10 11:55:27 + [2025-09-10 11:52:16] iteration 11888/ 11920 | consumed samples: 12173312 | elapsed time per iteration (ms): 5630.2 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.782743E+00 | loss scale: 1.0 | grad norm: 0.162 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:03:00.167572 | finish at 2025-09-10 11:55:16 + [2025-09-10 11:52:22] iteration 11889/ 11920 | consumed samples: 12174336 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.803590E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:02:54.395480 | finish at 2025-09-10 11:55:16 + [2025-09-10 11:52:27] iteration 11890/ 11920 | consumed samples: 12175360 | elapsed time per iteration (ms): 5634.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.785003E+00 | loss scale: 1.0 | grad norm: 0.203 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:02:49.022155 | finish at 2025-09-10 11:55:16 + [2025-09-10 11:52:33] iteration 11891/ 11920 | consumed samples: 12176384 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799009E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:02:43.016538 | finish at 2025-09-10 11:55:16 + [2025-09-10 11:52:39] iteration 11892/ 11920 | consumed samples: 12177408 | elapsed time per iteration (ms): 5626.4 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.790243E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:02:37.539647 | finish at 2025-09-10 11:55:16 + [2025-09-10 11:52:44] iteration 11893/ 11920 | consumed samples: 12178432 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.789929E+00 | loss scale: 1.0 | grad norm: 0.182 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:02:31.791605 | finish at 2025-09-10 11:55:16 + [2025-09-10 11:52:50] iteration 11894/ 11920 | consumed samples: 12179456 | elapsed time per iteration (ms): 5621.9 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799356E+00 | loss scale: 1.0 | grad norm: 0.191 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:02:26.169935 | finish at 2025-09-10 11:55:16 + [2025-09-10 11:52:56] iteration 11895/ 11920 | consumed samples: 12180480 | elapsed time per iteration (ms): 5885.5 | throughput per GPU (TFLOP/s/GPU): 76.7 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.789777E+00 | loss scale: 1.0 | grad norm: 0.183 | num zeros: 4.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:02:27.137493 | finish at 2025-09-10 11:55:23 + [2025-09-10 11:53:02] iteration 11896/ 11920 | consumed samples: 12181504 | elapsed time per iteration (ms): 5881.6 | throughput per GPU (TFLOP/s/GPU): 76.8 | MFU 7.76% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.784997E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:02:21.159227 | finish at 2025-09-10 11:55:23 + [2025-09-10 11:53:08] iteration 11897/ 11920 | consumed samples: 12182528 | elapsed time per iteration (ms): 6016.0 | throughput per GPU (TFLOP/s/GPU): 75.0 | MFU 7.59% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.793523E+00 | loss scale: 1.0 | grad norm: 0.208 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:02:18.368555 | finish at 2025-09-10 11:55:26 + [2025-09-10 11:53:13] iteration 11898/ 11920 | consumed samples: 12183552 | elapsed time per iteration (ms): 5643.7 | throughput per GPU (TFLOP/s/GPU): 80.0 | MFU 8.09% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.791353E+00 | loss scale: 1.0 | grad norm: 0.196 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:02:04.161130 | finish at 2025-09-10 11:55:17 + [2025-09-10 11:53:19] iteration 11899/ 11920 | consumed samples: 12184576 | elapsed time per iteration (ms): 5632.9 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.791230E+00 | loss scale: 1.0 | grad norm: 0.176 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:01:58.291906 | finish at 2025-09-10 11:55:17 + [2025-09-10 11:53:24] iteration 11900/ 11920 | consumed samples: 12185600 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.791343E+00 | loss scale: 1.0 | grad norm: 0.164 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:01:52.513313 | finish at 2025-09-10 11:55:17 + [2025-09-10 11:53:30] iteration 11901/ 11920 | consumed samples: 12186624 | elapsed time per iteration (ms): 5634.7 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.783219E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:01:47.059075 | finish at 2025-09-10 11:55:17 + [2025-09-10 11:53:36] iteration 11902/ 11920 | consumed samples: 12187648 | elapsed time per iteration (ms): 5632.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.771023E+00 | loss scale: 1.0 | grad norm: 0.154 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:01:41.376308 | finish at 2025-09-10 11:55:17 + [2025-09-10 11:53:42] iteration 11903/ 11920 | consumed samples: 12188672 | elapsed time per iteration (ms): 6070.6 | throughput per GPU (TFLOP/s/GPU): 74.4 | MFU 7.52% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.784194E+00 | loss scale: 1.0 | grad norm: 0.159 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:01:43.200419 | finish at 2025-09-10 11:55:25 + [2025-09-10 11:53:47] iteration 11904/ 11920 | consumed samples: 12189696 | elapsed time per iteration (ms): 5628.0 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.785886E+00 | loss scale: 1.0 | grad norm: 0.158 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:01:30.048309 | finish at 2025-09-10 11:55:18 + [2025-09-10 11:53:53] iteration 11905/ 11920 | consumed samples: 12190720 | elapsed time per iteration (ms): 5836.1 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.771925E+00 | loss scale: 1.0 | grad norm: 0.155 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:01:27.541373 | finish at 2025-09-10 11:55:21 + [2025-09-10 11:53:59] iteration 11906/ 11920 | consumed samples: 12191744 | elapsed time per iteration (ms): 5630.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.794962E+00 | loss scale: 1.0 | grad norm: 0.150 | num zeros: 6.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:01:18.828670 | finish at 2025-09-10 11:55:18 + [2025-09-10 11:54:05] iteration 11907/ 11920 | consumed samples: 12192768 | elapsed time per iteration (ms): 5624.6 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.789003E+00 | loss scale: 1.0 | grad norm: 0.167 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:01:13.120081 | finish at 2025-09-10 11:55:18 + [2025-09-10 11:54:10] iteration 11908/ 11920 | consumed samples: 12193792 | elapsed time per iteration (ms): 5918.2 | throughput per GPU (TFLOP/s/GPU): 76.3 | MFU 7.71% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.791209E+00 | loss scale: 1.0 | grad norm: 0.181 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:01:11.018294 | finish at 2025-09-10 11:55:21 + [2025-09-10 11:54:16] iteration 11909/ 11920 | consumed samples: 12194816 | elapsed time per iteration (ms): 5627.6 | throughput per GPU (TFLOP/s/GPU): 80.2 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.798051E+00 | loss scale: 1.0 | grad norm: 0.180 | num zeros: 2.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:01:01.903248 | finish at 2025-09-10 11:55:18 + [2025-09-10 11:54:22] iteration 11910/ 11920 | consumed samples: 12195840 | elapsed time per iteration (ms): 5634.3 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.788282E+00 | loss scale: 1.0 | grad norm: 0.188 | num zeros: 8.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:00:56.343133 | finish at 2025-09-10 11:55:18 + [2025-09-10 11:54:28] iteration 11911/ 11920 | consumed samples: 12196864 | elapsed time per iteration (ms): 5932.0 | throughput per GPU (TFLOP/s/GPU): 76.1 | MFU 7.70% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.791003E+00 | loss scale: 1.0 | grad norm: 0.206 | num zeros: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:00:53.387611 | finish at 2025-09-10 11:55:21 + [2025-09-10 11:54:33] iteration 11912/ 11920 | consumed samples: 12197888 | elapsed time per iteration (ms): 5625.7 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.11% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.792565E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:00:45.005308 | finish at 2025-09-10 11:55:18 + [2025-09-10 11:54:39] iteration 11913/ 11920 | consumed samples: 12198912 | elapsed time per iteration (ms): 5625.1 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.797979E+00 | loss scale: 1.0 | grad norm: 0.194 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:00:39.375761 | finish at 2025-09-10 11:55:18 + [2025-09-10 11:54:45] iteration 11914/ 11920 | consumed samples: 12199936 | elapsed time per iteration (ms): 5621.3 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.791074E+00 | loss scale: 1.0 | grad norm: 0.199 | num zeros: 5.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:00:33.727987 | finish at 2025-09-10 11:55:18 + [2025-09-10 11:54:50] iteration 11915/ 11920 | consumed samples: 12200960 | elapsed time per iteration (ms): 5623.5 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.795837E+00 | loss scale: 1.0 | grad norm: 0.228 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:00:28.117480 | finish at 2025-09-10 11:55:18 + [2025-09-10 11:54:56] iteration 11916/ 11920 | consumed samples: 12201984 | elapsed time per iteration (ms): 5623.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.795541E+00 | loss scale: 1.0 | grad norm: 0.230 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:00:22.491986 | finish at 2025-09-10 11:55:18 + [2025-09-10 11:55:02] iteration 11917/ 11920 | consumed samples: 12203008 | elapsed time per iteration (ms): 5841.1 | throughput per GPU (TFLOP/s/GPU): 77.3 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.789847E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 7.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:00:17.523204 | finish at 2025-09-10 11:55:19 + [2025-09-10 11:55:07] iteration 11918/ 11920 | consumed samples: 12204032 | elapsed time per iteration (ms): 5637.1 | throughput per GPU (TFLOP/s/GPU): 80.1 | MFU 8.10% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.796729E+00 | loss scale: 1.0 | grad norm: 0.211 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:00:11.274196 | finish at 2025-09-10 11:55:19 + [2025-09-10 11:55:13] iteration 11919/ 11920 | consumed samples: 12205056 | elapsed time per iteration (ms): 5835.3 | throughput per GPU (TFLOP/s/GPU): 77.4 | MFU 7.82% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.799804E+00 | loss scale: 1.0 | grad norm: 0.222 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:00:05.835260 | finish at 2025-09-10 11:55:19 + [2025-09-10 11:55:19] iteration 11920/ 11920 | consumed samples: 12206080 | elapsed time per iteration (ms): 5625.0 | throughput per GPU (TFLOP/s/GPU): 80.3 | MFU 8.12% | learning rate: 2.000000E-03 | global batch size: 1024 | lm loss: 2.788778E+00 | loss scale: 1.0 | grad norm: 0.214 | num zeros: 3.0 | number of skipped iterations: 0 | number of nan iterations: 0 | remaining time: 0:00:00 | finish at 2025-09-10 11:55:19 +(min, max) time across ranks (ms): + save-checkpoint ................................: (3962.90, 3963.25) +[rank24]:[W910 11:55:37.951416056 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())