| | Loaded loader_megatron_core as the loader. |
| | Loaded saver_llama2_hf_bf as the saver. |
| | Starting saver... |
| | Starting loader... |
| | fused_indices_to_multihot has reached end of life. Please migrate to a non-experimental function. |
| | /usr/local/lib/python3.12/dist-packages/modelopt/torch/utils/import_utils.py:31: UserWarning: Failed to import apex plugin due to: AttributeError("module 'transformers.modeling_utils' has no attribute 'Conv1D'"). You may ignore this warning if you do not need this plugin. |
| | warnings.warn( |
| | /usr/local/lib/python3.12/dist-packages/modelopt/torch/utils/import_utils.py:31: UserWarning: Failed to import huggingface plugin due to: AttributeError("module 'transformers.modeling_utils' has no attribute 'Conv1D'"). You may ignore this warning if you do not need this plugin. |
| | warnings.warn( |
| | /usr/local/lib/python3.12/dist-packages/modelopt/torch/utils/import_utils.py:31: UserWarning: Failed to import megatron plugin due to: AttributeError("module 'transformers.modeling_utils' has no attribute 'Conv1D'"). You may ignore this warning if you do not need this plugin. |
| | warnings.warn( |
| | Setting num_layers to 28 from checkpoint |
| | Setting hidden_size to 5120 from checkpoint |
| | Setting ffn_hidden_size to 27648 from checkpoint |
| | Setting seq_length to 131072 from checkpoint |
| | Setting num_attention_heads to 40 from checkpoint |
| | Setting num_query_groups to 8 from checkpoint |
| | Setting group_query_attention to True from checkpoint |
| | Setting kv_channels to 128 from checkpoint |
| | Setting max_position_embeddings to 131072 from checkpoint |
| | Setting position_embedding_type to rope from checkpoint |
| | Setting add_position_embedding to True from checkpoint |
| | Setting use_rotary_position_embeddings to True from checkpoint |
| | Setting rotary_base to 500000 from checkpoint |
| | Setting rotary_percent to 1.0 from checkpoint |
| | Setting rotary_interleaved to False from checkpoint |
| | Setting add_bias_linear to False from checkpoint |
| | Setting add_qkv_bias to False from checkpoint |
| | Setting squared_relu to False from checkpoint |
| | Setting swiglu to True from checkpoint |
| | Setting untie_embeddings_and_output_weights to True from checkpoint |
| | Setting apply_layernorm_1p to False from checkpoint |
| | Setting normalization to RMSNorm from checkpoint |
| | Setting apply_query_key_layer_scaling to False from checkpoint |
| | Setting attention_dropout to 0.0 from checkpoint |
| | Setting hidden_dropout to 0.0 from checkpoint |
| | Checkpoint did not provide arguments hybrid_override_pattern |
| | Checkpoint did not provide arguments spec |
| | Setting hybrid_attention_ratio to 0.0 from checkpoint |
| | Setting hybrid_mlp_ratio to 0.0 from checkpoint |
| | Checkpoint did not provide arguments num_experts |
| | Setting moe_layer_freq to 1 from checkpoint |
| | Setting moe_router_topk to 2 from checkpoint |
| | Setting moe_router_pre_softmax to False from checkpoint |
| | Setting moe_grouped_gemm to False from checkpoint |
| | Checkpoint did not provide arguments moe_shared_expert_intermediate_size |
| | Setting mamba_state_dim to 128 from checkpoint |
| | Setting mamba_head_dim to 64 from checkpoint |
| | Setting mamba_num_groups to 8 from checkpoint |
| | Checkpoint did not provide arguments mamba_num_heads |
| | Setting is_hybrid_model to False from checkpoint |
| | Checkpoint did not provide arguments heterogeneous_layers_config_path |
| | Checkpoint did not provide arguments heterogeneous_layers_config_encoded_json |
| | Setting tokenizer_type to SFTTokenizer from checkpoint |
| | Setting tokenizer_model to /cpfs01/users/wzhang/iquest-coder-v1.1/RepoData-Ucoder-32B-128k-from2.5.2/97.09B_instruct_iquest-coder from checkpoint |
| | Checkpoint did not provide arguments tiktoken_pattern |
| | Setting padded_vocab_size to 76800 from checkpoint |
| | INFO:megatron.core.num_microbatches_calculator:setting number of microbatches to constant 1 |
| | WARNING: one_logger package is required to enable e2e metrics tracking. please go to https: |
| | building GPT model ... |
| | (TP, PP) mismatch after resume ((1, 1) vs (8, 1) from checkpoint): RNG state will be ignored |
| | sharded_state_dict metadata loaded from the checkpoint: {'distrib_optim_sharding_type': 'dp_reshardable', 'singleton_local_shards': False, 'chained_optim_avoid_prefix': True} |
| | Job sharding has changed: Rerun state will be ignored |
| | loading distributed checkpoint from /tmp/megatron_convert_iter1970_node0_pid360_a250e6f4 at iteration 1970 |
| | /volume/pt-train/users/wzhang/wjj-workspace/code-sft/src/training/Megatron-LM/megatron/core/dist_checkpointing/strategies/torch.py:956: FutureWarning: `load_state_dict` is deprecated and will be removed in future versions. Please use `load` instead. |
| | checkpoint.load_state_dict( |
| | /usr/local/lib/python3.12/dist-packages/torch/distributed/checkpoint/planner_helpers.py:406: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor. |
| | device = getattr(value, "device", None) |
| | /usr/local/lib/python3.12/dist-packages/torch/distributed/checkpoint/default_planner.py:454: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor. |
| | and md.size != obj.size() |
| | checkpoint version 3.0 |
| | successfully loaded checkpoint from /tmp/megatron_convert_iter1970_node0_pid360_a250e6f4 [ t 1/1, p 1/1 ] at iteration 1970 |
| | sending embeddings |
| | sending transformer layer 0 |
| | sending transformer layer 1 |
| | sending transformer layer 2 |
| | sending transformer layer 3 |
| | sending transformer layer 4 |
| | sending transformer layer 5 |
| | sending transformer layer 6 |
| | sending transformer layer 7 |
| | sending transformer layer 8 |
| | sending transformer layer 9 |
| | sending transformer layer 10 |
| | sending transformer layer 11 |
| | sending transformer layer 12 |
| | sending transformer layer 13 |
| | sending transformer layer 14 |
| | sending transformer layer 15 |
| | sending transformer layer 16 |
| | sending transformer layer 17 |
| | sending transformer layer 18 |
| | sending transformer layer 19 |
| | sending transformer layer 20 |
| | sending transformer layer 21 |
| | sending transformer layer 22 |
| | sending transformer layer 23 |
| | sending transformer layer 24 |
| | sending transformer layer 25 |
| | sending transformer layer 26 |
| | sending transformer layer 27 |
| | sending final norm |
| | sending output layer |
| | Waiting for saver to complete... |
| | fused_indices_to_multihot has reached end of life. Please migrate to a non-experimental function. |
| | received embeddings |
| | received transformer layer 0 |
| | received transformer layer 1 |
| | received transformer layer 2 |
| | received transformer layer 3 |
| | received transformer layer 4 |
| | received transformer layer 5 |
| | received transformer layer 6 |
| | received transformer layer 7 |
| | received transformer layer 8 |
| | received transformer layer 9 |
| | received transformer layer 10 |
| | received transformer layer 11 |
| | received transformer layer 12 |
| | received transformer layer 13 |
| | received transformer layer 14 |
| | received transformer layer 15 |
| | received transformer layer 16 |
| | received transformer layer 17 |
| | received transformer layer 18 |
| | received transformer layer 19 |
| | received transformer layer 20 |
| | received transformer layer 21 |
| | received transformer layer 22 |
| | received transformer layer 23 |
| | received transformer layer 24 |
| | received transformer layer 25 |
| | received transformer layer 26 |
| | received transformer layer 27 |
| | received final norm |
| | received output layer |
| | Saving model to disk ... |
| |
|