nie10 commited on
Commit
762892f
·
verified ·
1 Parent(s): b2e70d2

Delete logs

Browse files
logs/20250528_162012/process_pids.txt DELETED
@@ -1,2 +0,0 @@
1
- Remote RM PID: 152057
2
- Train PID: 152058
 
 
 
logs/20250528_162012/remote_rm_qa.log DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:ebe8dec7692710158e31a50f8cd02f697d71f55f2faf30af719369115c557ea5
3
- size 78232454
 
 
 
 
logs/20250528_162012/train.log DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:bf5b4ff1c024d422b2c8485f55e4841aeac34550318b9cc7c96012175757ef6c
3
- size 18029525
 
 
 
 
logs/20250530_170544/process_pids.txt DELETED
@@ -1,2 +0,0 @@
1
- Remote RM PID: 272052
2
- Train PID: 272053
 
 
 
logs/20250530_170544/remote_rm_qa.log DELETED
@@ -1,12 +0,0 @@
1
- [2025-05-30 17:06:16,649] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
2
- load dataset success
3
- Starting reward server with reward type: gaussian_rbf
4
- RewardCalculator initialized with sigma=0.8, epsilon_bonus=0.3
5
- Dataset-specific adjustments disabled
6
- * Serving Flask app 'math_verifier_wolatex_regress_specifc'
7
- * Debug mode: off
8
- WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
9
- * Running on all addresses (0.0.0.0)
10
- * Running on http://127.0.0.1:2099
11
- * Running on http://10.140.0.158:2099
12
- Press CTRL+C to quit
 
 
 
 
 
 
 
 
 
 
 
 
 
logs/20250530_170544/train.log DELETED
@@ -1,211 +0,0 @@
1
- 2025-05-30 17:06:04,948 INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_0a781bcb0b13d679.zip.
2
- 2025-05-30 17:06:04,949 INFO packaging.py:575 -- Creating a file package for local module '/mnt/petrelfs/luyiting/MultiAgentEval/lmm-r1'.
3
- 2025-05-30 17:06:04,029 INFO cli.py:39 -- Job submission server address: http://127.0.0.1:6025
4
- 2025-05-30 17:06:09,743 SUCC cli.py:63 -- -------------------------------------------------------
5
- 2025-05-30 17:06:09,743 SUCC cli.py:64 -- Job 'raysubmit_W5jtiiFQMmWH77cp' submitted successfully
6
- 2025-05-30 17:06:09,743 SUCC cli.py:65 -- -------------------------------------------------------
7
- 2025-05-30 17:06:09,743 INFO cli.py:289 -- Next steps
8
- 2025-05-30 17:06:09,743 INFO cli.py:290 -- Query the logs of the job:
9
- 2025-05-30 17:06:09,743 INFO cli.py:292 -- ray job logs raysubmit_W5jtiiFQMmWH77cp
10
- 2025-05-30 17:06:09,743 INFO cli.py:294 -- Query the status of the job:
11
- 2025-05-30 17:06:09,743 INFO cli.py:296 -- ray job status raysubmit_W5jtiiFQMmWH77cp
12
- 2025-05-30 17:06:09,743 INFO cli.py:298 -- Request the job to be stopped:
13
- 2025-05-30 17:06:09,743 INFO cli.py:300 -- ray job stop raysubmit_W5jtiiFQMmWH77cp
14
- 2025-05-30 17:06:09,747 INFO cli.py:307 -- Tailing logs until the job exits (disable with --no-wait):
15
- 2025-05-30 17:06:09,268 INFO job_manager.py:531 -- Runtime env is setting up.
16
- [2025-05-30 17:06:30,019] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
17
- INFO 05-30 17:06:34 [__init__.py:239] Automatically detected platform cuda.
18
- 2025-05-30 17:06:35,206 INFO worker.py:1520 -- Using address 10.140.0.158:6033 set in the environment variable RAY_ADDRESS
19
- 2025-05-30 17:06:35,208 INFO worker.py:1660 -- Connecting to existing Ray cluster at address: 10.140.0.158:6033...
20
- 2025-05-30 17:06:35,229 INFO worker.py:1843 -- Connected to Ray cluster. View the dashboard at 10.140.0.158:6025 
21
- (pid=289229) INFO 05-30 17:06:56 [__init__.py:239] Automatically detected platform cuda.
22
- (LLMRayActor pid=289229) INFO 05-30 17:07:22 [config.py:585] This model supports multiple tasks: {'generate', 'classify', 'score', 'embed', 'reward'}. Defaulting to 'generate'.
23
- (LLMRayActor pid=289229) WARNING 05-30 17:07:22 [arg_utils.py:1846] VLLM_ATTENTION_BACKEND=triton is not supported by the V1 Engine. Falling back to V0. We recommend to remove VLLM_ATTENTION_BACKEND=triton from your config in favor of the V1 Engine.
24
- (LLMRayActor pid=289229) WARNING 05-30 17:07:22 [arg_utils.py:1745] --enable-prefix-caching is not supported for multimodal models in V0 and has been disabled.
25
- (LLMRayActor pid=289229) INFO 05-30 17:07:22 [llm_engine.py:241] Initializing a V0 LLM engine (v0.8.2.dev76+gf68cce8) with config: model='/mnt/petrelfs/luyiting/OmniCaptioner/output/output_sft_cot/', speculative_config=None, tokenizer='/mnt/petrelfs/luyiting/OmniCaptioner/output/output_sft_cot/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=43, served_model_name=/mnt/petrelfs/luyiting/OmniCaptioner/output/output_sft_cot/, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
26
- (pid=289228) INFO 05-30 17:06:56 [__init__.py:239] Automatically detected platform cuda. [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
27
- (LLMRayActor pid=289226) INFO 05-30 17:07:22 [config.py:585] This model supports multiple tasks: {'reward', 'score', 'embed', 'classify', 'generate'}. Defaulting to 'generate'.
28
- (LLMRayActor pid=289227) INFO 05-30 17:07:22 [config.py:585] This model supports multiple tasks: {'embed', 'reward', 'classify', 'score', 'generate'}. Defaulting to 'generate'.
29
- (LLMRayActor pid=289228) INFO 05-30 17:07:22 [config.py:585] This model supports multiple tasks: {'generate', 'classify', 'reward', 'embed', 'score'}. Defaulting to 'generate'.
30
- (LLMRayActor pid=289229) [2025-05-30 17:07:25,107] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
31
- (LLMRayActor pid=289229) INFO 05-30 17:07:29 [cuda.py:293] Using Flash Attention backend.
32
- (LLMRayActor pid=289228) WARNING 05-30 17:07:22 [arg_utils.py:1846] VLLM_ATTENTION_BACKEND=triton is not supported by the V1 Engine. Falling back to V0. We recommend to remove VLLM_ATTENTION_BACKEND=triton from your config in favor of the V1 Engine. [repeated 3x across cluster]
33
- (LLMRayActor pid=289228) WARNING 05-30 17:07:22 [arg_utils.py:1745] --enable-prefix-caching is not supported for multimodal models in V0 and has been disabled. [repeated 3x across cluster]
34
- (LLMRayActor pid=289228) INFO 05-30 17:07:22 [llm_engine.py:241] Initializing a V0 LLM engine (v0.8.2.dev76+gf68cce8) with config: model='/mnt/petrelfs/luyiting/OmniCaptioner/output/output_sft_cot/', speculative_config=None, tokenizer='/mnt/petrelfs/luyiting/OmniCaptioner/output/output_sft_cot/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=45, served_model_name=/mnt/petrelfs/luyiting/OmniCaptioner/output/output_sft_cot/, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,  [repeated 3x across cluster]
35
- (LLMRayActor pid=289227) INFO 05-30 17:07:32 [parallel_state.py:967] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
36
- (LLMRayActor pid=289227) INFO 05-30 17:07:32 [model_runner.py:1110] Starting to load model /mnt/petrelfs/luyiting/OmniCaptioner/output/output_sft_cot/...
37
- (LLMRayActor pid=289228) [2025-05-30 17:07:25,107] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [repeated 3x across cluster]
38
- (LLMRayActor pid=289227) INFO 05-30 17:07:33 [config.py:3229] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256] is overridden by config [256, 128, 2, 1, 4, 136, 8, 144, 16, 152, 24, 160, 32, 168, 40, 176, 48, 184, 56, 192, 64, 200, 72, 208, 80, 216, 88, 120, 224, 96, 232, 104, 240, 112, 248]
39
- (LLMRayActor pid=289227)
40
- Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
41
- (LLMRayActor pid=289229)
42
- Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:01<00:03, 1.28s/it]
43
- (LLMRayActor pid=289228) INFO 05-30 17:07:29 [cuda.py:293] Using Flash Attention backend. [repeated 3x across cluster]
44
- (LLMRayActor pid=289226) CUDA Error: out of memory at /mnt/petrelfs/luyiting/MultiAgentEval/vllm/csrc/cumem_allocator.cpp:62
45
- Traceback (most recent call last):
46
- File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/runpy.py", line 196, in _run_module_as_main
47
- return _run_code(code, main_globals, None,
48
- File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/runpy.py", line 86, in _run_code
49
- exec(code, run_globals)
50
- File "/mnt/petrelfs/luyiting/tmp_ray/session_2025-05-30_17-05-52_163961_268377/runtime_resources/working_dir_files/_ray_pkg_0a781bcb0b13d679/openrlhf/cli/train_ppo_ray.py", line 486, in <module>
51
- train(args)
52
- File "/mnt/petrelfs/luyiting/tmp_ray/session_2025-05-30_17-05-52_163961_268377/runtime_resources/working_dir_files/_ray_pkg_0a781bcb0b13d679/openrlhf/cli/train_ppo_ray.py", line 86, in train
53
- vllm_engines = create_vllm_engines(
54
- File "/mnt/petrelfs/luyiting/tmp_ray/session_2025-05-30_17-05-52_163961_268377/runtime_resources/working_dir_files/_ray_pkg_0a781bcb0b13d679/openrlhf/trainer/ray/vllm_engine.py", line 189, in create_vllm_engines
55
- batch_vllm_engine_call(vllm_engines, "sleep", rank_0_only=False)
56
- File "/mnt/petrelfs/luyiting/tmp_ray/session_2025-05-30_17-05-52_163961_268377/runtime_resources/working_dir_files/_ray_pkg_0a781bcb0b13d679/openrlhf/trainer/ray/vllm_engine.py", line 216, in batch_vllm_engine_call
57
- return ray.get(refs)
58
- File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
59
- return fn(*args, **kwargs)
60
- File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
61
- return func(*args, **kwargs)
62
- File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/ray/_private/worker.py", line 2782, in get
63
- values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
64
- File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/ray/_private/worker.py", line 931, in get_objects
65
- raise value
66
- ray.exceptions.ActorDiedError: The actor died because of an error raised in its creation task, ray::LLMRayActor.__init__() (pid=289226, ip=10.140.0.158, actor_id=00ecb757652fc9ecdd0f10d302000000, repr=<openrlhf.trainer.ray.vllm_engine.LLMRayActor object at 0x7f2c10149f60>)
67
- File "/mnt/petrelfs/luyiting/tmp_ray/session_2025-05-30_17-05-52_163961_268377/runtime_resources/working_dir_files/_ray_pkg_0a781bcb0b13d679/openrlhf/trainer/ray/vllm_engine.py", line 54, in __init__
68
- self.llm = LLM(*args, **kwargs)
69
- File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/utils.py", line 1037, in inner
70
- return fn(*args, **kwargs)
71
- File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/entrypoints/llm.py", line 243, in __init__
72
- self.llm_engine = LLMEngine.from_engine_args(
73
- File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/engine/llm_engine.py", line 520, in from_engine_args
74
- return engine_cls.from_vllm_config(
75
- File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/engine/llm_engine.py", line 496, in from_vllm_config
76
- return cls(
77
- File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/engine/llm_engine.py", line 280, in __init__
78
- self.model_executor = executor_class(vllm_config=vllm_config, )
79
- File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/executor/executor_base.py", line 52, in __init__
80
- self._init_executor()
81
- File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/executor/uniproc_executor.py", line 47, in _init_executor
82
- self.collective_rpc("load_model")
83
- File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
84
- answer = run_method(self.driver_worker, method, args, kwargs)
85
- File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/utils.py", line 2255, in run_method
86
- return func(*args, **kwargs)
87
- File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/worker/worker.py", line 183, in load_model
88
- self.model_runner.load_model()
89
- File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/worker/model_runner.py", line 1113, in load_model
90
- self.model = get_model(vllm_config=self.vllm_config)
91
- File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
92
- return loader.load_model(vllm_config=vllm_config)
93
- File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/model_loader/loader.py", line 423, in load_model
94
- model = _initialize_model(vllm_config=vllm_config)
95
- File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/model_loader/loader.py", line 126, in _initialize_model
96
- return model_class(vllm_config=vllm_config, prefix=prefix)
97
- File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/qwen2_5_vl.py", line 802, in __init__
98
- self.language_model = init_vllm_registered_model(
99
- File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/utils.py", line 260, in init_vllm_registered_model
100
- return _initialize_model(vllm_config=vllm_config, prefix=prefix)
101
- File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/model_loader/loader.py", line 126, in _initialize_model
102
- return model_class(vllm_config=vllm_config, prefix=prefix)
103
- File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/qwen2.py", line 431, in __init__
104
- self.model = Qwen2Model(vllm_config=vllm_config,
105
- File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/compilation/decorators.py", line 151, in __init__
106
- old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
107
- File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/qwen2.py", line 300, in __init__
108
- self.start_layer, self.end_layer, self.layers = make_layers(
109
- File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/utils.py", line 557, in make_layers
110
- [PPMissingLayer() for _ in range(start_layer)] + [
111
- File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/utils.py", line 558, in <listcomp>
112
- maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
113
- File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/qwen2.py", line 302, in <lambda>
114
- lambda prefix: Qwen2DecoderLayer(config=config,
115
- File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/qwen2.py", line 206, in __init__
116
- self.self_attn = Qwen2Attention(
117
- File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/qwen2.py", line 136, in __init__
118
- self.qkv_proj = QKVParallelLinear(
119
- File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/layers/linear.py", line 833, in __init__
120
- super().__init__(input_size=input_size,
121
- File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/layers/linear.py", line 398, in __init__
122
- self.quant_method.create_weights(
123
- File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/layers/linear.py", line 178, in create_weights
124
- weight = Parameter(torch.empty(sum(output_partition_sizes),
125
- File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/torch/utils/_device.py", line 104, in __torch_function__
126
- return func(*args, **kwargs)
127
- torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacity of 79.32 GiB of which 53.56 MiB is free. Process 229250 has 63.42 GiB memory in use. Including non-PyTorch memory, this process has 15.85 GiB memory in use. Of the allocated memory 14.33 GiB is allocated by PyTorch, with 158.43 MiB allocated in private pools (e.g., CUDA Graphs), and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
128
- (LLMRayActor pid=289226) Exception raised in creation task: The actor died because of an error raised in its creation task, ray::LLMRayActor.__init__() (pid=289226, ip=10.140.0.158, actor_id=00ecb757652fc9ecdd0f10d302000000, repr=<openrlhf.trainer.ray.vllm_engine.LLMRayActor object at 0x7f2c10149f60>)
129
- (LLMRayActor pid=289226) File "/mnt/petrelfs/luyiting/tmp_ray/session_2025-05-30_17-05-52_163961_268377/runtime_resources/working_dir_files/_ray_pkg_0a781bcb0b13d679/openrlhf/trainer/ray/vllm_engine.py", line 54, in __init__
130
- (LLMRayActor pid=289226) self.llm = LLM(*args, **kwargs)
131
- (LLMRayActor pid=289226) File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/utils.py", line 1037, in inner
132
- (LLMRayActor pid=289226) return fn(*args, **kwargs)
133
- (LLMRayActor pid=289226) File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/entrypoints/llm.py", line 243, in __init__
134
- (LLMRayActor pid=289226) self.llm_engine = LLMEngine.from_engine_args(
135
- (LLMRayActor pid=289226) File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/engine/llm_engine.py", line 520, in from_engine_args
136
- (LLMRayActor pid=289226) return engine_cls.from_vllm_config(
137
- (LLMRayActor pid=289226) File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/engine/llm_engine.py", line 496, in from_vllm_config
138
- (LLMRayActor pid=289226) return cls(
139
- (LLMRayActor pid=289226) File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/engine/llm_engine.py", line 280, in __init__
140
- (LLMRayActor pid=289226) self.model_executor = executor_class(vllm_config=vllm_config, )
141
- (LLMRayActor pid=289226) File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/executor/executor_base.py", line 52, in __init__
142
- (LLMRayActor pid=289226) self._init_executor()
143
- (LLMRayActor pid=289226) File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/executor/uniproc_executor.py", line 47, in _init_executor
144
- (LLMRayActor pid=289226) self.collective_rpc("load_model")
145
- (LLMRayActor pid=289226) File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
146
- (LLMRayActor pid=289226) answer = run_method(self.driver_worker, method, args, kwargs)
147
- (LLMRayActor pid=289226) File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/utils.py", line 2255, in run_method
148
- (LLMRayActor pid=289226) return func(*args, **kwargs)
149
- (LLMRayActor pid=289226) File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/worker/worker.py", line 183, in load_model
150
- (LLMRayActor pid=289226) self.model_runner.load_model()
151
- (LLMRayActor pid=289226) File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/worker/model_runner.py", line 1113, in load_model
152
- (LLMRayActor pid=289226) self.model = get_model(vllm_config=self.vllm_config)
153
- (LLMRayActor pid=289226) File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
154
- (LLMRayActor pid=289226) return loader.load_model(vllm_config=vllm_config)
155
- (LLMRayActor pid=289226) File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/model_loader/loader.py", line 423, in load_model
156
- (LLMRayActor pid=289226) model = _initialize_model(vllm_config=vllm_config)
157
- (LLMRayActor pid=289226) File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/model_loader/loader.py", line 126, in _initialize_model
158
- (LLMRayActor pid=289226) return model_class(vllm_config=vllm_config, prefix=prefix)
159
- (LLMRayActor pid=289226) File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/qwen2_5_vl.py", line 802, in __init__
160
- (LLMRayActor pid=289226) self.language_model = init_vllm_registered_model(
161
- (LLMRayActor pid=289226) File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/utils.py", line 260, in init_vllm_registered_model
162
- (LLMRayActor pid=289226) return _initialize_model(vllm_config=vllm_config, prefix=prefix)
163
- (LLMRayActor pid=289226) File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/model_loader/loader.py", line 126, in _initialize_model
164
- (LLMRayActor pid=289226) return model_class(vllm_config=vllm_config, prefix=prefix)
165
- (LLMRayActor pid=289226) File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/qwen2.py", line 431, in __init__
166
- (LLMRayActor pid=289226) self.model = Qwen2Model(vllm_config=vllm_config,
167
- (LLMRayActor pid=289226) File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/compilation/decorators.py", line 151, in __init__
168
- (LLMRayActor pid=289226) old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
169
- (LLMRayActor pid=289226) File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/qwen2.py", line 300, in __init__
170
- (LLMRayActor pid=289226) self.start_layer, self.end_layer, self.layers = make_layers(
171
- (LLMRayActor pid=289226) File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/utils.py", line 557, in make_layers
172
- (LLMRayActor pid=289226) [PPMissingLayer() for _ in range(start_layer)] + [
173
- (LLMRayActor pid=289226) File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/utils.py", line 558, in <listcomp>
174
- (LLMRayActor pid=289226) maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
175
- (LLMRayActor pid=289226) File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/qwen2.py", line 302, in <lambda>
176
- (LLMRayActor pid=289226) lambda prefix: Qwen2DecoderLayer(config=config,
177
- (LLMRayActor pid=289226) File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/qwen2.py", line 206, in __init__
178
- (LLMRayActor pid=289226) self.self_attn = Qwen2Attention(
179
- (LLMRayActor pid=289226) File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/qwen2.py", line 136, in __init__
180
- (LLMRayActor pid=289226) self.qkv_proj = QKVParallelLinear(
181
- (LLMRayActor pid=289226) File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/layers/linear.py", line 833, in __init__
182
- (LLMRayActor pid=289226) super().__init__(input_size=input_size,
183
- (LLMRayActor pid=289226) File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/layers/linear.py", line 398, in __init__
184
- (LLMRayActor pid=289226) self.quant_method.create_weights(
185
- (LLMRayActor pid=289226) File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/layers/linear.py", line 178, in create_weights
186
- (LLMRayActor pid=289226) weight = Parameter(torch.empty(sum(output_partition_sizes),
187
- (LLMRayActor pid=289226) File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/torch/utils/_device.py", line 104, in __torch_function__
188
- (LLMRayActor pid=289226) return func(*args, **kwargs)
189
- (LLMRayActor pid=289226) torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacity of 79.32 GiB of which 53.56 MiB is free. Process 229250 has 63.42 GiB memory in use. Including non-PyTorch memory, this process has 15.85 GiB memory in use. Of the allocated memory 14.33 GiB is allocated by PyTorch, with 158.43 MiB allocated in private pools (e.g., CUDA Graphs), and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
190
- (LLMRayActor pid=289226) INFO 05-30 17:07:34 [parallel_state.py:967] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0 [repeated 3x across cluster]
191
- (LLMRayActor pid=289226) INFO 05-30 17:07:34 [model_runner.py:1110] Starting to load model /mnt/petrelfs/luyiting/OmniCaptioner/output/output_sft_cot/... [repeated 3x across cluster]
192
- (LLMRayActor pid=289226) INFO 05-30 17:07:35 [config.py:3229] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256] is overridden by config [256, 128, 2, 1, 4, 136, 8, 144, 16, 152, 24, 160, 32, 168, 40, 176, 48, 184, 56, 192, 64, 200, 72, 208, 80, 216, 88, 120, 224, 96, 232, 104, 240, 112, 248] [repeated 3x across cluster]
193
- (LLMRayActor pid=289228)
194
- Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s] [repeated 2x across cluster]
195
- (LLMRayActor pid=289228)
196
- Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:01<00:03, 1.27s/it] [repeated 2x across cluster]
197
- 2025-05-30 17:07:40,909 ERR cli.py:71 -- ---------------------------------------
198
- 2025-05-30 17:07:40,909 ERR cli.py:72 -- Job 'raysubmit_W5jtiiFQMmWH77cp' failed
199
- 2025-05-30 17:07:40,909 ERR cli.py:73 -- ---------------------------------------
200
- 2025-05-30 17:07:40,909 INFO cli.py:86 -- Status message: Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars):
201
- (LLMRayActor pid=289226) File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/torch/utils/_device.py", line 104, in __torch_function__
202
- (LLMRayActor pid=289226) return func(*args, **kwargs)
203
- (LLMRayActor pid=289226) torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacity of 79.32 GiB of which 53.56 MiB is free. Process 229250 has 63.42 GiB memory in use. Including non-PyTorch memory, this process has 15.85 GiB memory in use. Of the allocated memory 14.33 GiB is allocated by PyTorch, with 158.43 MiB allocated in private pools (e.g., CUDA Graphs), and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
204
- (LLMRayActor pid=289226) INFO 05-30 17:07:34 [parallel_state.py:967] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0 [repeated 3x across cluster]
205
- (LLMRayActor pid=289226) INFO 05-30 17:07:34 [model_runner.py:1110] Starting to load model /mnt/petrelfs/luyiting/OmniCaptioner/output/output_sft_cot/... [repeated 3x across cluster]
206
- (LLMRayActor pid=289226) INFO 05-30 17:07:35 [config.py:3229] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256] is overridden by config [256, 128, 2, 1, 4, 136, 8, 144, 16, 152, 24, 160, 32, 168, 40, 176, 48, 184, 56, 192, 64, 200, 72, 208, 80, 216, 88, 120, 224, 96, 232, 104, 240, 112, 248] [repeated 3x across cluster]
207
- (LLMRayActor pid=289228)
208
- Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s] [repeated 2x across cluster]
209
- (LLMRayActor pid=289228)
210
- Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:01<00:03, 1.27s/it] [repeated 2x across cluster]
211
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
logs/20250530_171246/process_pids.txt DELETED
@@ -1,2 +0,0 @@
1
- Remote RM PID: 311979
2
- Train PID: 311980
 
 
 
logs/20250530_171246/remote_rm_qa.log DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:24c9b3b226ea3b24ce83505dbca946ffd752efc3dd7d0c7e6420fcc3efac9202
3
- size 147836970
 
 
 
 
logs/20250530_171246/train.log DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:fec8e338c58046e39d85642abb654259a62a2bf7a500a3ae347ff7a49a5f71a6
3
- size 32941037
 
 
 
 
logs/20250602_151514/process_pids.txt DELETED
@@ -1,2 +0,0 @@
1
- Remote RM PID: 27320
2
- Train PID: 27321
 
 
 
logs/20250602_151514/remote_rm_qa.log DELETED
The diff for this file is too large to render. See raw diff
 
logs/20250602_151514/train.log DELETED
The diff for this file is too large to render. See raw diff
 
logs/20250602_160801/process_pids.txt DELETED
@@ -1,2 +0,0 @@
1
- Remote RM PID: 120812
2
- Train PID: 120813
 
 
 
logs/20250602_160801/remote_rm_qa.log DELETED
@@ -1,9 +0,0 @@
1
- [2025-06-02 16:09:41,172] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
2
- load dataset success
3
- Starting reward server with reward type: gaussian_rbf
4
- RewardCalculator initialized with sigma=0.8, epsilon_bonus=0.3
5
- Dataset-specific adjustments disabled
6
- * Serving Flask app 'math_verifier_wolatex_regress_specifc'
7
- * Debug mode: off
8
- Address already in use
9
- Port 2099 is in use by another program. Either identify and stop that program, or start the server with a different port.
 
 
 
 
 
 
 
 
 
 
logs/20250602_160801/train.log DELETED
@@ -1,17 +0,0 @@
1
- 2025-06-02 16:09:11,176 INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_cda7e53a52438a70.zip.
2
- 2025-06-02 16:09:11,176 INFO packaging.py:575 -- Creating a file package for local module '/mnt/petrelfs/luyiting/MultiAgentEval/lmm-r1'.
3
- 2025-06-02 16:08:58,977 INFO cli.py:39 -- Job submission server address: http://127.0.0.1:6025
4
- 2025-06-02 16:09:52,257 SUCC cli.py:63 -- -------------------------------------------------------
5
- 2025-06-02 16:09:52,257 SUCC cli.py:64 -- Job 'raysubmit_zJp8BHzeYdb1T6E9' submitted successfully
6
- 2025-06-02 16:09:52,257 SUCC cli.py:65 -- -------------------------------------------------------
7
- 2025-06-02 16:09:52,257 INFO cli.py:289 -- Next steps
8
- 2025-06-02 16:09:52,257 INFO cli.py:290 -- Query the logs of the job:
9
- 2025-06-02 16:09:52,257 INFO cli.py:292 -- ray job logs raysubmit_zJp8BHzeYdb1T6E9
10
- 2025-06-02 16:09:52,257 INFO cli.py:294 -- Query the status of the job:
11
- 2025-06-02 16:09:52,257 INFO cli.py:296 -- ray job status raysubmit_zJp8BHzeYdb1T6E9
12
- 2025-06-02 16:09:52,257 INFO cli.py:298 -- Request the job to be stopped:
13
- 2025-06-02 16:09:52,257 INFO cli.py:300 -- ray job stop raysubmit_zJp8BHzeYdb1T6E9
14
- 2025-06-02 16:09:52,262 INFO cli.py:307 -- Tailing logs until the job exits (disable with --no-wait):
15
- 2025-06-02 16:09:50,572 INFO job_manager.py:531 -- Runtime env is setting up.
16
- 2025-06-02 16:10:26,125 INFO cli.py:89 -- Status for job 'raysubmit_zJp8BHzeYdb1T6E9': RUNNING
17
- 2025-06-02 16:10:26,125 INFO cli.py:91 -- Status message: Job is currently running.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
logs/20250602_161106/process_pids.txt DELETED
@@ -1,2 +0,0 @@
1
- Remote RM PID: 151580
2
- Train PID: 151581
 
 
 
logs/20250602_161106/remote_rm_qa.log DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:51b798dda8b23aefeca4bc5901a779c09310625b77939cfd49ebd19781a56448
3
- size 83905836
 
 
 
 
logs/20250602_161106/train.log DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:4af1a85d0f95248f6760f4b7e81e02ceec648792c0c9658d9f180f04bbe33550
3
- size 27665611