diff --git "a/logs/20250526_191827/train.log" "b/logs/20250526_191827/train.log" new file mode 100644--- /dev/null +++ "b/logs/20250526_191827/train.log" @@ -0,0 +1,2057 @@ +2025-05-26 19:18:47,642 INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_321e0871e56ca1df.zip. +2025-05-26 19:18:47,643 INFO packaging.py:575 -- Creating a file package for local module '/mnt/petrelfs/luyiting/MultiAgentEval/lmm-r1'. +2025-05-26 19:18:46,714 INFO cli.py:39 -- Job submission server address: http://127.0.0.1:2983 +2025-05-26 19:18:53,403 SUCC cli.py:63 -- ------------------------------------------------------- +2025-05-26 19:18:53,403 SUCC cli.py:64 -- Job 'raysubmit_YRVyrdpJQsux5E4C' submitted successfully +2025-05-26 19:18:53,403 SUCC cli.py:65 -- ------------------------------------------------------- +2025-05-26 19:18:53,403 INFO cli.py:289 -- Next steps +2025-05-26 19:18:53,403 INFO cli.py:290 -- Query the logs of the job: +2025-05-26 19:18:53,403 INFO cli.py:292 -- ray job logs raysubmit_YRVyrdpJQsux5E4C +2025-05-26 19:18:53,403 INFO cli.py:294 -- Query the status of the job: +2025-05-26 19:18:53,403 INFO cli.py:296 -- ray job status raysubmit_YRVyrdpJQsux5E4C +2025-05-26 19:18:53,403 INFO cli.py:298 -- Request the job to be stopped: +2025-05-26 19:18:53,404 INFO cli.py:300 -- ray job stop raysubmit_YRVyrdpJQsux5E4C +2025-05-26 19:18:53,406 INFO cli.py:307 -- Tailing logs until the job exits (disable with --no-wait): +2025-05-26 19:18:52,847 INFO job_manager.py:531 -- Runtime env is setting up. +[2025-05-26 19:19:13,190] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) +INFO 05-26 19:19:17 [__init__.py:239] Automatically detected platform cuda. +2025-05-26 19:19:18,557 INFO worker.py:1520 -- Using address 10.140.1.87:6231 set in the environment variable RAY_ADDRESS +2025-05-26 19:19:18,559 INFO worker.py:1660 -- Connecting to existing Ray cluster at address: 10.140.1.87:6231... +2025-05-26 19:19:18,580 INFO worker.py:1843 -- Connected to Ray cluster. View the dashboard at 10.140.1.87:2983  +(pid=89991) INFO 05-26 19:19:38 [__init__.py:239] Automatically detected platform cuda. +(LLMRayActor pid=89992) INFO 05-26 19:20:05 [config.py:585] This model supports multiple tasks: {'score', 'reward', 'classify', 'embed', 'generate'}. Defaulting to 'generate'. +(pid=89985) INFO 05-26 19:19:38 [__init__.py:239] Automatically detected platform cuda. [repeated 7x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.) +(LLMRayActor pid=89991) INFO 05-26 19:20:05 [config.py:585] This model supports multiple tasks: {'reward', 'generate', 'embed', 'classify', 'score'}. Defaulting to 'generate'. +(LLMRayActor pid=89991) WARNING 05-26 19:20:05 [arg_utils.py:1846] VLLM_ATTENTION_BACKEND=triton is not supported by the V1 Engine. Falling back to V0. We recommend to remove VLLM_ATTENTION_BACKEND=triton from your config in favor of the V1 Engine. +(LLMRayActor pid=89991) WARNING 05-26 19:20:05 [arg_utils.py:1745] --enable-prefix-caching is not supported for multimodal models in V0 and has been disabled. +(LLMRayActor pid=89991) INFO 05-26 19:20:05 [llm_engine.py:241] Initializing a V0 LLM engine (v0.8.2.dev76+gf68cce8) with config: model='/mnt/petrelfs/luyiting/ckt/Qwen2.5-VL-7B-Instruct/Qwen2.5-VL-7B-Instruct/', speculative_config=None, tokenizer='/mnt/petrelfs/luyiting/ckt/Qwen2.5-VL-7B-Instruct/Qwen2.5-VL-7B-Instruct/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=42, served_model_name=/mnt/petrelfs/luyiting/ckt/Qwen2.5-VL-7B-Instruct/Qwen2.5-VL-7B-Instruct/, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, +(LLMRayActor pid=89988) INFO 05-26 19:20:05 [config.py:585] This model supports multiple tasks: {'embed', 'generate', 'classify', 'reward', 'score'}. Defaulting to 'generate'. +(LLMRayActor pid=89993) INFO 05-26 19:20:05 [config.py:585] This model supports multiple tasks: {'score', 'reward', 'embed', 'classify', 'generate'}. Defaulting to 'generate'. +(LLMRayActor pid=89986) INFO 05-26 19:20:05 [config.py:585] This model supports multiple tasks: {'classify', 'generate', 'score', 'embed', 'reward'}. Defaulting to 'generate'. +(LLMRayActor pid=89989) INFO 05-26 19:20:05 [config.py:585] This model supports multiple tasks: {'generate', 'reward', 'score', 'embed', 'classify'}. Defaulting to 'generate'. +(LLMRayActor pid=89990) INFO 05-26 19:20:05 [config.py:585] This model supports multiple tasks: {'reward', 'classify', 'embed', 'generate', 'score'}. Defaulting to 'generate'. +(LLMRayActor pid=89985) INFO 05-26 19:20:05 [config.py:585] This model supports multiple tasks: {'classify', 'embed', 'reward', 'score', 'generate'}. Defaulting to 'generate'. +(LLMRayActor pid=89991) [2025-05-26 19:20:08,722] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) +(LLMRayActor pid=89991) INFO 05-26 19:20:13 [cuda.py:293] Using Flash Attention backend. +(LLMRayActor pid=89985) WARNING 05-26 19:20:05 [arg_utils.py:1846] VLLM_ATTENTION_BACKEND=triton is not supported by the V1 Engine. Falling back to V0. We recommend to remove VLLM_ATTENTION_BACKEND=triton from your config in favor of the V1 Engine. [repeated 7x across cluster] +(LLMRayActor pid=89985) WARNING 05-26 19:20:05 [arg_utils.py:1745] --enable-prefix-caching is not supported for multimodal models in V0 and has been disabled. [repeated 7x across cluster] +(LLMRayActor pid=89985) INFO 05-26 19:20:05 [llm_engine.py:241] Initializing a V0 LLM engine (v0.8.2.dev76+gf68cce8) with config: model='/mnt/petrelfs/luyiting/ckt/Qwen2.5-VL-7B-Instruct/Qwen2.5-VL-7B-Instruct/', speculative_config=None, tokenizer='/mnt/petrelfs/luyiting/ckt/Qwen2.5-VL-7B-Instruct/Qwen2.5-VL-7B-Instruct/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=49, served_model_name=/mnt/petrelfs/luyiting/ckt/Qwen2.5-VL-7B-Instruct/Qwen2.5-VL-7B-Instruct/, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,  [repeated 7x across cluster] +(LLMRayActor pid=89988) INFO 05-26 19:20:16 [parallel_state.py:967] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0 +(LLMRayActor pid=89988) INFO 05-26 19:20:16 [model_runner.py:1110] Starting to load model /mnt/petrelfs/luyiting/ckt/Qwen2.5-VL-7B-Instruct/Qwen2.5-VL-7B-Instruct/... +(LLMRayActor pid=89985) [2025-05-26 19:20:08,723] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [repeated 7x across cluster] +(LLMRayActor pid=89988) INFO 05-26 19:20:16 [config.py:3229] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256] is overridden by config [256, 128, 2, 1, 4, 136, 8, 144, 16, 152, 24, 160, 32, 168, 40, 176, 48, 184, 56, 192, 64, 200, 72, 208, 80, 216, 88, 120, 224, 96, 232, 104, 240, 112, 248] +(LLMRayActor pid=89990) +Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00 +(ActorModelRayActor pid=100959) [2025-05-26 19:23:14,377] [INFO] [logging.py:128:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False +(ActorModelRayActor pid=100959) [2025-05-26 19:23:14,377] [INFO] [logging.py:128:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 3 optimizer +(ActorModelRayActor pid=100959) [2025-05-26 19:23:14,607] [INFO] [utils.py:781:see_memory_usage] Stage 3 initialize beginning +(ActorModelRayActor pid=100959) [2025-05-26 19:23:14,607] [INFO] [utils.py:782:see_memory_usage] MA 1.94 GB Max_MA 3.98 GB CA 4.04 GB Max_CA 4 GB +(ActorModelRayActor pid=100959) [2025-05-26 19:23:14,608] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 469.96 GB, percent = 46.7% +(ActorModelRayActor pid=100959) [2025-05-26 19:23:14,611] [INFO] [stage3.py:170:__init__] Reduce bucket size 500000000 +(ActorModelRayActor pid=100959) [2025-05-26 19:23:14,611] [INFO] [stage3.py:171:__init__] Prefetch bucket size 50000000 +(ActorModelRayActor pid=100959) [2025-05-26 19:23:14,839] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [begin] +(ActorModelRayActor pid=100959) [2025-05-26 19:23:14,840] [INFO] [utils.py:782:see_memory_usage] MA 1.94 GB Max_MA 1.94 GB CA 4.04 GB Max_CA 4 GB +(ActorModelRayActor pid=100959) [2025-05-26 19:23:14,841] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 469.95 GB, percent = 46.7% +(ActorModelRayActor pid=100959) Parameter Offload: Total persistent parameters: 848896 in 368 params +(ActorModelRayActor pid=100959) [2025-05-26 19:23:15,090] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [end] +(ActorModelRayActor pid=100959) [2025-05-26 19:23:15,090] [INFO] [utils.py:782:see_memory_usage] MA 1.94 GB Max_MA 1.94 GB CA 4.04 GB Max_CA 4 GB +(ActorModelRayActor pid=100959) [2025-05-26 19:23:15,091] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 469.96 GB, percent = 46.7% +(ActorModelRayActor pid=100959) [2025-05-26 19:23:15,310] [INFO] [utils.py:781:see_memory_usage] Before creating fp16 partitions +(ActorModelRayActor pid=100959) [2025-05-26 19:23:15,311] [INFO] [utils.py:782:see_memory_usage] MA 1.94 GB Max_MA 1.94 GB CA 4.04 GB Max_CA 4 GB +(ActorModelRayActor pid=100959) [2025-05-26 19:23:15,312] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 469.96 GB, percent = 46.7% +(ActorModelRayActor pid=100959) [2025-05-26 19:23:17,718] [INFO] [utils.py:781:see_memory_usage] After creating fp16 partitions: 2 +(ActorModelRayActor pid=100959) [2025-05-26 19:23:17,719] [INFO] [utils.py:782:see_memory_usage] MA 1.93 GB Max_MA 1.94 GB CA 1.94 GB Max_CA 4 GB +(ActorModelRayActor pid=100959) [2025-05-26 19:23:17,719] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 473.35 GB, percent = 47.0% +(ActorModelRayActor pid=100959) [2025-05-26 19:23:17,939] [INFO] [utils.py:781:see_memory_usage] Before creating fp32 partitions +(ActorModelRayActor pid=100959) [2025-05-26 19:23:17,940] [INFO] [utils.py:782:see_memory_usage] MA 1.93 GB Max_MA 1.93 GB CA 1.94 GB Max_CA 2 GB +(ActorModelRayActor pid=100959) [2025-05-26 19:23:17,941] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 476.15 GB, percent = 47.3% +(ReferenceModelRayActor pid=101495) +Loading checkpoint shards: 60%|██████ | 3/5 [00:20<00:12, 6.39s/it] [repeated 16x across cluster] +(ActorModelRayActor pid=100959) [2025-05-26 19:23:22,363] [INFO] [utils.py:781:see_memory_usage] After creating fp32 partitions +(ActorModelRayActor pid=100959) [2025-05-26 19:23:22,364] [INFO] [utils.py:782:see_memory_usage] MA 1.93 GB Max_MA 1.93 GB CA 1.94 GB Max_CA 2 GB +(ActorModelRayActor pid=100959) [2025-05-26 19:23:22,364] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 500.86 GB, percent = 49.7% +(ActorModelRayActor pid=100959) [2025-05-26 19:23:22,595] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states +(ActorModelRayActor pid=100959) [2025-05-26 19:23:22,596] [INFO] [utils.py:782:see_memory_usage] MA 1.93 GB Max_MA 1.93 GB CA 1.94 GB Max_CA 2 GB +(ActorModelRayActor pid=100959) [2025-05-26 19:23:22,597] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 505.19 GB, percent = 50.2% +(ActorModelRayActor pid=100959) in preprocess_data None False [repeated 26000x across cluster] +(ActorModelRayActor pid=100959) [2025-05-26 19:23:14,318] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 8 [repeated 7x across cluster] +(ReferenceModelRayActor pid=103677) +Loading checkpoint shards: 80%|████████ | 4/5 [00:25<00:06, 6.04s/it] +(ReferenceModelRayActor pid=103679) +Loading checkpoint shards: 80%|████████ | 4/5 [00:25<00:06, 6.04s/it] +(ReferenceModelRayActor pid=103677) +Loading checkpoint shards: 100%|██████████| 5/5 [00:25<00:00, 3.96s/it] +Loading checkpoint shards: 100%|██████████| 5/5 [00:25<00:00, 5.16s/it] +(ReferenceModelRayActor pid=101495) Actor( +(ReferenceModelRayActor pid=101495) (model): Qwen2_5_VLForConditionalGeneration( +(ReferenceModelRayActor pid=101495) (visual): Qwen2_5_VisionTransformerPretrainedModel( +(ReferenceModelRayActor pid=101495) (patch_embed): Qwen2_5_VisionPatchEmbed( +(ReferenceModelRayActor pid=101495) (proj): Conv3d(3, 1280, kernel_size=(2, 14, 14), stride=(2, 14, 14), bias=False) +(ReferenceModelRayActor pid=101495) ) +(ReferenceModelRayActor pid=101495) (rotary_pos_emb): Qwen2_5_VisionRotaryEmbedding() +(ReferenceModelRayActor pid=101495) (blocks): ModuleList( +(ReferenceModelRayActor pid=101495) (0-31): 32 x Qwen2_5_VLVisionBlock( +(ReferenceModelRayActor pid=101495) (norm1): Qwen2RMSNorm((0,), eps=1e-06) +(ReferenceModelRayActor pid=101495) (norm2): Qwen2RMSNorm((0,), eps=1e-06) +(ReferenceModelRayActor pid=101495) (attn): Qwen2_5_VLVisionFlashAttention2( +(ReferenceModelRayActor pid=101495) (qkv): Linear(in_features=1280, out_features=3840, bias=True) +(ReferenceModelRayActor pid=101495) (proj): Linear(in_features=1280, out_features=1280, bias=True) +(ReferenceModelRayActor pid=101495) ) +(ReferenceModelRayActor pid=101495) (mlp): Qwen2_5_VLMLP( +(ReferenceModelRayActor pid=101495) (gate_proj): Linear(in_features=1280, out_features=3420, bias=True) +(ReferenceModelRayActor pid=101495) (up_proj): Linear(in_features=1280, out_features=3420, bias=True) +(ReferenceModelRayActor pid=101495) (down_proj): Linear(in_features=3420, out_features=1280, bias=True) +(ReferenceModelRayActor pid=101495) (act_fn): SiLU() +(ReferenceModelRayActor pid=101495) ) +(ReferenceModelRayActor pid=101495) ) +(ReferenceModelRayActor pid=101495) ) +(ReferenceModelRayActor pid=101495) (merger): Qwen2_5_VLPatchMerger( +(ReferenceModelRayActor pid=101495) (ln_q): Qwen2RMSNorm((0,), eps=1e-06) +(ReferenceModelRayActor pid=101495) (mlp): Sequential( +(ReferenceModelRayActor pid=101495) (0): Linear(in_features=5120, out_features=5120, bias=True) +(ReferenceModelRayActor pid=101495) (1): GELU(approximate='none') +(ReferenceModelRayActor pid=101495) (2): Linear(in_features=5120, out_features=3584, bias=True) +(ReferenceModelRayActor pid=101495) ) +(ReferenceModelRayActor pid=101495) ) +(ReferenceModelRayActor pid=101495) ) +(ReferenceModelRayActor pid=101495) (model): Qwen2_5_VLModel( +(ReferenceModelRayActor pid=101495) (embed_tokens): Embedding(152064, 3584) +(ReferenceModelRayActor pid=101495) (layers): ModuleList( +(ReferenceModelRayActor pid=101495) (0-27): 28 x Qwen2_5_VLDecoderLayer( +(ReferenceModelRayActor pid=101495) (self_attn): Qwen2_5_VLFlashAttention2( +(ReferenceModelRayActor pid=101495) (q_proj): Linear(in_features=3584, out_features=3584, bias=True) +(ReferenceModelRayActor pid=101495) (k_proj): Linear(in_features=3584, out_features=512, bias=True) +(ReferenceModelRayActor pid=101495) (v_proj): Linear(in_features=3584, out_features=512, bias=True) +(ReferenceModelRayActor pid=101495) (o_proj): Linear(in_features=3584, out_features=3584, bias=False) +(ReferenceModelRayActor pid=101495) (rotary_emb): Qwen2_5_VLRotaryEmbedding() +(ReferenceModelRayActor pid=101495) ) +(ReferenceModelRayActor pid=101495) (mlp): Qwen2MLP( +(ReferenceModelRayActor pid=101495) (gate_proj): Linear(in_features=3584, out_features=18944, bias=False) +(ReferenceModelRayActor pid=101495) (up_proj): Linear(in_features=3584, out_features=18944, bias=False) +(ReferenceModelRayActor pid=101495) (down_proj): Linear(in_features=18944, out_features=3584, bias=False) +(ReferenceModelRayActor pid=101495) (act_fn): SiLU() +(ReferenceModelRayActor pid=101495) ) +(ReferenceModelRayActor pid=101495) (input_layernorm): Qwen2RMSNorm((0,), eps=1e-06) +(ReferenceModelRayActor pid=101495) (post_attention_layernorm): Qwen2RMSNorm((0,), eps=1e-06) +(ReferenceModelRayActor pid=101495) ) +(ReferenceModelRayActor pid=101495) ) +(ReferenceModelRayActor pid=101495) (norm): Qwen2RMSNorm((0,), eps=1e-06) +(ReferenceModelRayActor pid=101495) (rotary_emb): Qwen2_5_VLRotaryEmbedding() +(ReferenceModelRayActor pid=101495) ) +(ReferenceModelRayActor pid=101495) (lm_head): Linear(in_features=3584, out_features=152064, bias=False) +(ReferenceModelRayActor pid=101495) ) +(ReferenceModelRayActor pid=101495) ) +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:26,658] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed info: version=0.16.4, git-hash=unknown, git-branch=unknown +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:26,658] [INFO] [comm.py:683:init_distributed] Distributed backend already initialized +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:26,677] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:26,679] [INFO] [logging.py:128:log_dist] [Rank 0] Creating ZeRO Offload +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:26,901] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [begin] +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:26,902] [INFO] [utils.py:782:see_memory_usage] MA 1.94 GB Max_MA 3.98 GB CA 4.04 GB Max_CA 4 GB +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:26,903] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 582.59 GB, percent = 57.8% +(ReferenceModelRayActor pid=101495) Parameter Offload: Total persistent parameters: 848896 in 368 params +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,125] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [end] +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,125] [INFO] [utils.py:782:see_memory_usage] MA 1.94 GB Max_MA 1.94 GB CA 4.04 GB Max_CA 4 GB +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,126] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 586.3 GB, percent = 58.2% +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,128] [INFO] [config.py:1001:print] DeepSpeedEngine configuration: +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,129] [INFO] [config.py:1005:print] activation_checkpointing_config { +(ReferenceModelRayActor pid=101495) "partition_activations": false, +(ReferenceModelRayActor pid=101495) "contiguous_memory_optimization": false, +(ReferenceModelRayActor pid=101495) "cpu_checkpointing": false, +(ReferenceModelRayActor pid=101495) "number_checkpoints": null, +(ReferenceModelRayActor pid=101495) "synchronize_checkpoint_boundary": false, +(ReferenceModelRayActor pid=101495) "profile": false +(ReferenceModelRayActor pid=101495) } +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,129] [INFO] [config.py:1005:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'intra_op_parallelism': 1, 'single_submit': False, 'overlap_events': True, 'use_gds': False} +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,129] [INFO] [config.py:1005:print] amp_enabled .................. False +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,129] [INFO] [config.py:1005:print] amp_params ................... False +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,129] [INFO] [config.py:1005:print] autotuning_config ............ { +(ReferenceModelRayActor pid=101495) "enabled": false, +(ReferenceModelRayActor pid=101495) "start_step": null, +(ReferenceModelRayActor pid=101495) "end_step": null, +(ReferenceModelRayActor pid=101495) "metric_path": null, +(ReferenceModelRayActor pid=101495) "arg_mappings": null, +(ReferenceModelRayActor pid=101495) "metric": "throughput", +(ReferenceModelRayActor pid=101495) "model_info": null, +(ReferenceModelRayActor pid=101495) "results_dir": "autotuning_results", +(ReferenceModelRayActor pid=101495) "exps_dir": "autotuning_exps", +(ReferenceModelRayActor pid=101495) "overwrite": true, +(ReferenceModelRayActor pid=101495) "fast": true, +(ReferenceModelRayActor pid=101495) "start_profile_step": 3, +(ReferenceModelRayActor pid=101495) "end_profile_step": 5, +(ReferenceModelRayActor pid=101495) "tuner_type": "gridsearch", +(ReferenceModelRayActor pid=101495) "tuner_early_stopping": 5, +(ReferenceModelRayActor pid=101495) "tuner_num_trials": 50, +(ReferenceModelRayActor pid=101495) "model_info_path": null, +(ReferenceModelRayActor pid=101495) "mp_size": 1, +(ReferenceModelRayActor pid=101495) "max_train_batch_size": null, +(ReferenceModelRayActor pid=101495) "min_train_batch_size": 1, +(ReferenceModelRayActor pid=101495) "max_train_micro_batch_size_per_gpu": 1.024000e+03, +(ReferenceModelRayActor pid=101495) "min_train_micro_batch_size_per_gpu": 1, +(ReferenceModelRayActor pid=101495) "num_tuning_micro_batch_sizes": 3 +(ReferenceModelRayActor pid=101495) } +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,129] [INFO] [config.py:1005:print] bfloat16_enabled ............. True +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,129] [INFO] [config.py:1005:print] bfloat16_immediate_grad_update False +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,129] [INFO] [config.py:1005:print] checkpoint_parallel_write_pipeline False +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,129] [INFO] [config.py:1005:print] checkpoint_tag_validation_enabled True +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,129] [INFO] [config.py:1005:print] checkpoint_tag_validation_fail False +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,129] [INFO] [config.py:1005:print] comms_config ................. +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,129] [INFO] [config.py:1005:print] communication_data_type ...... None +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,129] [INFO] [config.py:1005:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,129] [INFO] [config.py:1005:print] curriculum_enabled_legacy .... False +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,129] [INFO] [config.py:1005:print] curriculum_params_legacy ..... False +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,129] [INFO] [config.py:1005:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,129] [INFO] [config.py:1005:print] data_efficiency_enabled ...... False +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] dataloader_drop_last ......... False +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] disable_allgather ............ False +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] dump_state ................... False +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] dynamic_loss_scale_args ...... None +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] eigenvalue_enabled ........... False +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] eigenvalue_gas_boundary_resolution 1 +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] eigenvalue_layer_name ........ bert.encoder.layer +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] eigenvalue_layer_num ......... 0 +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] eigenvalue_max_iter .......... 100 +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] eigenvalue_stability ......... 1e-06 +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] eigenvalue_tol ............... 0.01 +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] eigenvalue_verbose ........... False +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] elasticity_enabled ........... False +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] flops_profiler_config ........ { +(ReferenceModelRayActor pid=101495) "enabled": false, +(ReferenceModelRayActor pid=101495) "recompute_fwd_factor": 0.0, +(ReferenceModelRayActor pid=101495) "profile_step": 1, +(ReferenceModelRayActor pid=101495) "module_depth": -1, +(ReferenceModelRayActor pid=101495) "top_modules": 1, +(ReferenceModelRayActor pid=101495) "detailed": true, +(ReferenceModelRayActor pid=101495) "output_file": null +(ReferenceModelRayActor pid=101495) } +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] fp16_auto_cast ............... None +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] fp16_enabled ................. False +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] fp16_master_weights_and_gradients False +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] global_rank .................. 0 +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] grad_accum_dtype ............. None +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] gradient_accumulation_steps .. 8 +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] gradient_clipping ............ 1.0 +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] gradient_predivide_factor .... 1.0 +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,131] [INFO] [config.py:1005:print] graph_harvesting ............. False +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,131] [INFO] [config.py:1005:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,131] [INFO] [config.py:1005:print] initial_dynamic_scale ........ 1 +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,131] [INFO] [config.py:1005:print] load_universal_checkpoint .... False +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,131] [INFO] [config.py:1005:print] loss_scale ................... 1.0 +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,131] [INFO] [config.py:1005:print] memory_breakdown ............. False +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,131] [INFO] [config.py:1005:print] mics_hierarchial_params_gather False +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,131] [INFO] [config.py:1005:print] mics_shard_size .............. -1 +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,131] [INFO] [config.py:1005:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,131] [INFO] [config.py:1005:print] nebula_config ................ { +(ReferenceModelRayActor pid=101495) "enabled": false, +(ReferenceModelRayActor pid=101495) "persistent_storage_path": null, +(ReferenceModelRayActor pid=101495) "persistent_time_interval": 100, +(ReferenceModelRayActor pid=101495) "num_of_version_in_retention": 2, +(ReferenceModelRayActor pid=101495) "enable_nebula_load": true, +(ReferenceModelRayActor pid=101495) "load_path": null +(ReferenceModelRayActor pid=101495) } +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,131] [INFO] [config.py:1005:print] optimizer_legacy_fusion ...... False +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,131] [INFO] [config.py:1005:print] optimizer_name ............... None +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,131] [INFO] [config.py:1005:print] optimizer_params ............. None +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,131] [INFO] [config.py:1005:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True} +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,131] [INFO] [config.py:1005:print] pld_enabled .................. False +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,131] [INFO] [config.py:1005:print] pld_params ................... False +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,131] [INFO] [config.py:1005:print] prescale_gradients ........... False +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,131] [INFO] [config.py:1005:print] scheduler_name ............... None +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,131] [INFO] [config.py:1005:print] scheduler_params ............. None +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,131] [INFO] [config.py:1005:print] seq_parallel_communication_data_type torch.float32 +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,131] [INFO] [config.py:1005:print] sparse_attention ............. None +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,132] [INFO] [config.py:1005:print] sparse_gradients_enabled ..... False +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,132] [INFO] [config.py:1005:print] steps_per_print .............. 100 +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,132] [INFO] [config.py:1005:print] tensor_parallel_config ....... dtype=torch.float16 autotp_size=0 tensor_parallel=TPConfig(tp_size=1, tp_grain_size=1, mpu=None, tp_group=None) injection_policy_tuple=None keep_module_on_host=False replace_with_kernel_inject=False +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,132] [INFO] [config.py:1005:print] timers_config ................ enabled=True synchronized=True +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,132] [INFO] [config.py:1005:print] train_batch_size ............. 128 +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,132] [INFO] [config.py:1005:print] train_micro_batch_size_per_gpu 2 +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,132] [INFO] [config.py:1005:print] use_data_before_expert_parallel_ False +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,132] [INFO] [config.py:1005:print] use_node_local_storage ....... False +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,132] [INFO] [config.py:1005:print] wall_clock_breakdown ......... False +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,132] [INFO] [config.py:1005:print] weight_quantization_config ... None +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,132] [INFO] [config.py:1005:print] world_size ................... 8 +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,132] [INFO] [config.py:1005:print] zero_allow_untested_optimizer False +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,132] [INFO] [config.py:1005:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100000000, max_in_cpu=1000000000, pin_memory=True) offload_optimizer=None sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=False module_granularity_threshold=0 use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False zeropp_loco_param=None mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True log_trace_cache_warnings=False +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,132] [INFO] [config.py:1005:print] zero_enabled ................. True +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,132] [INFO] [config.py:1005:print] zero_force_ds_cpu_optimizer .. True +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,132] [INFO] [config.py:1005:print] zero_optimization_stage ...... 3 +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,132] [INFO] [config.py:991:print_user_config] json = { +(ReferenceModelRayActor pid=101495) "steps_per_print": 100, +(ReferenceModelRayActor pid=101495) "zero_optimization": { +(ReferenceModelRayActor pid=101495) "stage": 3, +(ReferenceModelRayActor pid=101495) "stage3_max_live_parameters": "auto", +(ReferenceModelRayActor pid=101495) "stage3_max_reuse_distance": "auto", +(ReferenceModelRayActor pid=101495) "stage3_param_persistence_threshold": "auto", +(ReferenceModelRayActor pid=101495) "stage3_prefetch_bucket_size": "auto", +(ReferenceModelRayActor pid=101495) "offload_param": { +(ReferenceModelRayActor pid=101495) "device": "none", +(ReferenceModelRayActor pid=101495) "pin_memory": true +(ReferenceModelRayActor pid=101495) } +(ReferenceModelRayActor pid=101495) }, +(ReferenceModelRayActor pid=101495) "bf16": { +(ReferenceModelRayActor pid=101495) "enabled": true +(ReferenceModelRayActor pid=101495) }, +(ReferenceModelRayActor pid=101495) "gradient_clipping": 1.0, +(ReferenceModelRayActor pid=101495) "prescale_gradients": false, +(ReferenceModelRayActor pid=101495) "wall_clock_breakdown": false, +(ReferenceModelRayActor pid=101495) "train_micro_batch_size_per_gpu": 2, +(ReferenceModelRayActor pid=101495) "train_batch_size": 128 +(ReferenceModelRayActor pid=101495) } +(ActorModelRayActor pid=100959) [2025-05-26 19:23:30,762] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states +(ActorModelRayActor pid=100959) [2025-05-26 19:23:30,764] [INFO] [stage3.py:534:_setup_for_real_optimizer] optimizer state initialized +(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:26,658] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 8 [repeated 8x across cluster] +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,511] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,513] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer_Stage3 +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,513] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed using client LR scheduler +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,513] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed LR Scheduler = +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,513] [INFO] [logging.py:128:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[(0.9, 0.95), (0.9, 0.95)] +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,519] [INFO] [config.py:1005:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100000000, max_in_cpu=1000000000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=True, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=False module_granularity_threshold=0 use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False zeropp_loco_param=None mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True log_trace_cache_warnings=False +(ActorModelRayActor pid=100959) "device": "none" +(ActorModelRayActor pid=100959) "offload_optimizer": { +(ActorModelRayActor pid=100959) "device": "cpu", +(ActorModelRayActor pid=100959) "sub_group_size": "auto", +(ActorModelRayActor pid=100959) "reduce_bucket_size": "auto", +(ActorModelRayActor pid=100959) "zero_hpz_partition_size": 1, +(ActorModelRayActor pid=100959) "zero_quantized_weights": false, +(ActorModelRayActor pid=100959) "zero_quantized_gradients": false, +(ActorModelRayActor pid=100959) "reduce_scatter": true +(ActorModelRayActor pid=100959) "data_types": { +(ActorModelRayActor pid=100959) "grad_accum_dtype": null +(ActorModelRayActor pid=100959) "checkpoint": { +(ActorModelRayActor pid=100959) "load_universal": false +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,512] [INFO] [utils.py:782:see_memory_usage] MA 2.86 GB Max_MA 4.89 GB CA 5.02 GB Max_CA 5 GB  [repeated 2x across cluster] +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,513] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 549.69 GB, percent = 54.6% [repeated 2x across cluster] +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,515] [INFO] [config.py:1001:print] DeepSpeedEngine configuration: +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,515] [INFO] [config.py:1005:print] activation_checkpointing_config { +(ActorModelRayActor pid=100959) "partition_activations": false, +(ActorModelRayActor pid=100959) "contiguous_memory_optimization": false, +(ActorModelRayActor pid=100959) "cpu_checkpointing": false, +(ActorModelRayActor pid=100959) "number_checkpoints": null, +(ActorModelRayActor pid=100959) "synchronize_checkpoint_boundary": false, +(ActorModelRayActor pid=100959) "profile": false +(ActorModelRayActor pid=100959) } [repeated 5x across cluster] +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,515] [INFO] [config.py:1005:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'intra_op_parallelism': 1, 'single_submit': False, 'overlap_events': True, 'use_gds': False} +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,515] [INFO] [config.py:1005:print] amp_enabled .................. False +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,515] [INFO] [config.py:1005:print] amp_params ................... False +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,515] [INFO] [config.py:1005:print] autotuning_config ............ { +(ActorModelRayActor pid=100959) "enabled": false,  [repeated 3x across cluster] +(ActorModelRayActor pid=100959) "start_step": null, +(ActorModelRayActor pid=100959) "end_step": null, +(ActorModelRayActor pid=100959) "metric_path": null, +(ActorModelRayActor pid=100959) "arg_mappings": null, +(ActorModelRayActor pid=100959) "metric": "throughput", +(ActorModelRayActor pid=100959) "model_info": null, +(ActorModelRayActor pid=100959) "results_dir": "autotuning_results", +(ActorModelRayActor pid=100959) "exps_dir": "autotuning_exps", +(ActorModelRayActor pid=100959) "overwrite": true, +(ActorModelRayActor pid=100959) "fast": true, +(ActorModelRayActor pid=100959) "start_profile_step": 3, +(ActorModelRayActor pid=100959) "end_profile_step": 5, +(ActorModelRayActor pid=100959) "tuner_type": "gridsearch", +(ActorModelRayActor pid=100959) "tuner_early_stopping": 5, +(ActorModelRayActor pid=100959) "tuner_num_trials": 50, +(ActorModelRayActor pid=100959) "model_info_path": null, +(ActorModelRayActor pid=100959) "mp_size": 1, +(ActorModelRayActor pid=100959) "max_train_batch_size": null, +(ActorModelRayActor pid=100959) "min_train_batch_size": 1, +(ActorModelRayActor pid=100959) "max_train_micro_batch_size_per_gpu": 1.024000e+03, +(ActorModelRayActor pid=100959) "min_train_micro_batch_size_per_gpu": 1, +(ActorModelRayActor pid=100959) "num_tuning_micro_batch_sizes": 3 +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,515] [INFO] [config.py:1005:print] bfloat16_enabled ............. True +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,517] [INFO] [config.py:1005:print] fp16_master_weights_and_gradients False [repeated 2x across cluster] +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,516] [INFO] [config.py:1005:print] checkpoint_parallel_write_pipeline False +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,516] [INFO] [config.py:1005:print] checkpoint_tag_validation_enabled True +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,516] [INFO] [config.py:1005:print] checkpoint_tag_validation_fail False +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,516] [INFO] [config.py:1005:print] comms_config ................. +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,516] [INFO] [config.py:1005:print] communication_data_type ...... None +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,516] [INFO] [config.py:1005:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,516] [INFO] [config.py:1005:print] curriculum_enabled_legacy .... False +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,516] [INFO] [config.py:1005:print] curriculum_params_legacy ..... False +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,516] [INFO] [config.py:1005:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,516] [INFO] [config.py:1005:print] data_efficiency_enabled ...... False +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,516] [INFO] [config.py:1005:print] dataloader_drop_last ......... False +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,516] [INFO] [config.py:1005:print] disable_allgather ............ False +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,516] [INFO] [config.py:1005:print] dump_state ................... False +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,516] [INFO] [config.py:1005:print] dynamic_loss_scale_args ...... None +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,516] [INFO] [config.py:1005:print] eigenvalue_enabled ........... False +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,516] [INFO] [config.py:1005:print] eigenvalue_gas_boundary_resolution 1 +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,516] [INFO] [config.py:1005:print] eigenvalue_layer_name ........ bert.encoder.layer +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,516] [INFO] [config.py:1005:print] eigenvalue_layer_num ......... 0 +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,516] [INFO] [config.py:1005:print] eigenvalue_max_iter .......... 100 +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,516] [INFO] [config.py:1005:print] eigenvalue_stability ......... 1e-06 +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,517] [INFO] [config.py:1005:print] eigenvalue_tol ............... 0.01 +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,517] [INFO] [config.py:1005:print] eigenvalue_verbose ........... False +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,517] [INFO] [config.py:1005:print] elasticity_enabled ........... False +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,517] [INFO] [config.py:1005:print] flops_profiler_config ........ { +(ActorModelRayActor pid=100959) "recompute_fwd_factor": 0.0, +(ActorModelRayActor pid=100959) "profile_step": 1, +(ActorModelRayActor pid=100959) "module_depth": -1, +(ActorModelRayActor pid=100959) "top_modules": 1, +(ActorModelRayActor pid=100959) "detailed": true, +(ActorModelRayActor pid=100959) "output_file": null +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,517] [INFO] [config.py:1005:print] fp16_auto_cast ............... None +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,517] [INFO] [config.py:1005:print] fp16_enabled ................. False +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,517] [INFO] [config.py:1005:print] global_rank .................. 0 +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,517] [INFO] [config.py:1005:print] grad_accum_dtype ............. None +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,517] [INFO] [config.py:1005:print] gradient_accumulation_steps .. 8 +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,517] [INFO] [config.py:1005:print] gradient_clipping ............ 1.0 +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,517] [INFO] [config.py:1005:print] gradient_predivide_factor .... 1.0 +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,517] [INFO] [config.py:1005:print] graph_harvesting ............. False +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,517] [INFO] [config.py:1005:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,517] [INFO] [config.py:1005:print] initial_dynamic_scale ........ 1 +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,517] [INFO] [config.py:1005:print] load_universal_checkpoint .... False +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,517] [INFO] [config.py:1005:print] loss_scale ................... 1.0 +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,517] [INFO] [config.py:1005:print] memory_breakdown ............. False +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,517] [INFO] [config.py:1005:print] mics_hierarchial_params_gather False +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,517] [INFO] [config.py:1005:print] mics_shard_size .............. -1 +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,517] [INFO] [config.py:1005:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] nebula_config ................ { +(ActorModelRayActor pid=100959) "persistent_storage_path": null, +(ActorModelRayActor pid=100959) "persistent_time_interval": 100, +(ActorModelRayActor pid=100959) "num_of_version_in_retention": 2, +(ActorModelRayActor pid=100959) "enable_nebula_load": true, +(ActorModelRayActor pid=100959) "load_path": null +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] optimizer_legacy_fusion ...... False +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] optimizer_name ............... None +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] optimizer_params ............. None +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True} +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] pld_enabled .................. False +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] pld_params ................... False +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] prescale_gradients ........... False +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] scheduler_name ............... None +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] scheduler_params ............. None +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] seq_parallel_communication_data_type torch.float32 +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] sparse_attention ............. None +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] sparse_gradients_enabled ..... False +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] steps_per_print .............. 100 +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] tensor_parallel_config ....... dtype=torch.float16 autotp_size=0 tensor_parallel=TPConfig(tp_size=1, tp_grain_size=1, mpu=None, tp_group=None) injection_policy_tuple=None keep_module_on_host=False replace_with_kernel_inject=False +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] timers_config ................ enabled=True synchronized=True +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] train_batch_size ............. 128 +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] train_micro_batch_size_per_gpu 2 +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] use_data_before_expert_parallel_ False +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] use_node_local_storage ....... False +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] wall_clock_breakdown ......... False +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] weight_quantization_config ... None +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,519] [INFO] [config.py:1005:print] world_size ................... 8 +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,519] [INFO] [config.py:1005:print] zero_allow_untested_optimizer False +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,519] [INFO] [config.py:1005:print] zero_enabled ................. True +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,519] [INFO] [config.py:1005:print] zero_force_ds_cpu_optimizer .. True +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,519] [INFO] [config.py:1005:print] zero_optimization_stage ...... 3 +(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,519] [INFO] [config.py:991:print_user_config] json = { +(ActorModelRayActor pid=100959) "steps_per_print": 100, +(ActorModelRayActor pid=100959) "zero_optimization": { +(ActorModelRayActor pid=100959) "stage": 3, +(ActorModelRayActor pid=100959) "stage3_prefetch_bucket_size": "auto",  [repeated 4x across cluster] +(ActorModelRayActor pid=100959) "offload_param": { +(ActorModelRayActor pid=100959) "pin_memory": true +(ActorModelRayActor pid=100959) },  [repeated 6x across cluster] +(ActorModelRayActor pid=100959) "bf16": { +(ActorModelRayActor pid=100959) "enabled": true +(ActorModelRayActor pid=100959) "gradient_clipping": 1.0, +(ActorModelRayActor pid=100959) "prescale_gradients": false, +(ActorModelRayActor pid=100959) "wall_clock_breakdown": false, +(ActorModelRayActor pid=100959) "train_micro_batch_size_per_gpu": 2, +(ActorModelRayActor pid=100959) "train_batch_size": 128 +(ActorModelRayActor pid=100959) wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information. +(ReferenceModelRayActor pid=101495) +Loading checkpoint shards: 80%|████████ | 4/5 [00:26<00:06, 6.26s/it] [repeated 6x across cluster] +(ReferenceModelRayActor pid=101495) +Loading checkpoint shards: 100%|██████████| 5/5 [00:27<00:00, 4.45s/it] +Loading checkpoint shards: 100%|██████████| 5/5 [00:27<00:00, 5.51s/it] [repeated 7x across cluster] +(ActorModelRayActor pid=100959) wandb: Tracking run with wandb version 0.19.8 +(ActorModelRayActor pid=100959) wandb: W&B syncing is set to `offline` in this directory. +(ActorModelRayActor pid=100959) wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing. +(LLMRayActor pid=89986) init_process_group: master_address=10.140.1.87, master_port=1652, rank=6, world_size=9, group_name=openrlhf +(ActorModelRayActor pid=100959) +Episode [1/2]: 0%| | 0/187 [00:00