nie10 commited on Jun 17, 2025

Commit

762892f

verified ·

1 Parent(s): b2e70d2

Delete logs

Browse files

Files changed (18) hide show

logs/20250528_162012/process_pids.txt +0 -2
logs/20250528_162012/remote_rm_qa.log +0 -3
logs/20250528_162012/train.log +0 -3
logs/20250530_170544/process_pids.txt +0 -2
logs/20250530_170544/remote_rm_qa.log +0 -12
logs/20250530_170544/train.log +0 -211
logs/20250530_171246/process_pids.txt +0 -2
logs/20250530_171246/remote_rm_qa.log +0 -3
logs/20250530_171246/train.log +0 -3
logs/20250602_151514/process_pids.txt +0 -2
logs/20250602_151514/remote_rm_qa.log +0 -0
logs/20250602_151514/train.log +0 -0
logs/20250602_160801/process_pids.txt +0 -2
logs/20250602_160801/remote_rm_qa.log +0 -9
logs/20250602_160801/train.log +0 -17
logs/20250602_161106/process_pids.txt +0 -2
logs/20250602_161106/remote_rm_qa.log +0 -3
logs/20250602_161106/train.log +0 -3

logs/20250528_162012/process_pids.txt DELETED Viewed

	@@ -1,2 +0,0 @@
1	- Remote RM PID: 152057
2	- Train PID: 152058

logs/20250528_162012/remote_rm_qa.log DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:ebe8dec7692710158e31a50f8cd02f697d71f55f2faf30af719369115c557ea5
-size 78232454

logs/20250528_162012/train.log DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:bf5b4ff1c024d422b2c8485f55e4841aeac34550318b9cc7c96012175757ef6c
-size 18029525

logs/20250530_170544/process_pids.txt DELETED Viewed

	@@ -1,2 +0,0 @@
1	- Remote RM PID: 272052
2	- Train PID: 272053

logs/20250530_170544/remote_rm_qa.log DELETED Viewed

@@ -1,12 +0,0 @@
-[2025-05-30 17:06:16,649] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
-load dataset success
-Starting reward server with reward type: gaussian_rbf
-RewardCalculator initialized with sigma=0.8, epsilon_bonus=0.3
-Dataset-specific adjustments disabled
- * Serving Flask app 'math_verifier_wolatex_regress_specifc'
- * Debug mode: off
-WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
- * Running on all addresses (0.0.0.0)
- * Running on http://127.0.0.1:2099
- * Running on http://10.140.0.158:2099
-Press CTRL+C to quit

logs/20250530_170544/train.log DELETED Viewed

@@ -1,211 +0,0 @@
-2025-05-30 17:06:04,948	INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_0a781bcb0b13d679.zip.
-2025-05-30 17:06:04,949	INFO packaging.py:575 -- Creating a file package for local module '/mnt/petrelfs/luyiting/MultiAgentEval/lmm-r1'.
-2025-05-30 17:06:04,029	INFO cli.py:39 -- [37mJob submission server address[39m: [1mhttp://127.0.0.1:6025[22m
-2025-05-30 17:06:09,743	SUCC cli.py:63 -- [32m-------------------------------------------------------[39m
-2025-05-30 17:06:09,743	SUCC cli.py:64 -- [32mJob 'raysubmit_W5jtiiFQMmWH77cp' submitted successfully[39m
-2025-05-30 17:06:09,743	SUCC cli.py:65 -- [32m-------------------------------------------------------[39m
-2025-05-30 17:06:09,743	INFO cli.py:289 -- [36mNext steps[39m
-2025-05-30 17:06:09,743	INFO cli.py:290 -- Query the logs of the job:
-2025-05-30 17:06:09,743	INFO cli.py:292 -- [1mray job logs raysubmit_W5jtiiFQMmWH77cp[22m
-2025-05-30 17:06:09,743	INFO cli.py:294 -- Query the status of the job:
-2025-05-30 17:06:09,743	INFO cli.py:296 -- [1mray job status raysubmit_W5jtiiFQMmWH77cp[22m
-2025-05-30 17:06:09,743	INFO cli.py:298 -- Request the job to be stopped:
-2025-05-30 17:06:09,743	INFO cli.py:300 -- [1mray job stop raysubmit_W5jtiiFQMmWH77cp[22m
-2025-05-30 17:06:09,747	INFO cli.py:307 -- Tailing logs until the job exits (disable with --no-wait):
-2025-05-30 17:06:09,268	INFO job_manager.py:531 -- Runtime env is setting up.
-[2025-05-30 17:06:30,019] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
-INFO 05-30 17:06:34 [__init__.py:239] Automatically detected platform cuda.
-2025-05-30 17:06:35,206	INFO worker.py:1520 -- Using address 10.140.0.158:6033 set in the environment variable RAY_ADDRESS
-2025-05-30 17:06:35,208	INFO worker.py:1660 -- Connecting to existing Ray cluster at address: 10.140.0.158:6033...
-2025-05-30 17:06:35,229	INFO worker.py:1843 -- Connected to Ray cluster. View the dashboard at [1m[32m10.140.0.158:6025 [39m[22m
-[36m(pid=289229)[0m INFO 05-30 17:06:56 [__init__.py:239] Automatically detected platform cuda.
-[36m(LLMRayActor pid=289229)[0m INFO 05-30 17:07:22 [config.py:585] This model supports multiple tasks: {'generate', 'classify', 'score', 'embed', 'reward'}. Defaulting to 'generate'.
-[36m(LLMRayActor pid=289229)[0m WARNING 05-30 17:07:22 [arg_utils.py:1846] VLLM_ATTENTION_BACKEND=triton is not supported by the V1 Engine. Falling back to V0. We recommend to remove VLLM_ATTENTION_BACKEND=triton from your config in favor of the V1 Engine.
-[36m(LLMRayActor pid=289229)[0m WARNING 05-30 17:07:22 [arg_utils.py:1745] --enable-prefix-caching is not supported for multimodal models in V0 and has been disabled.
-[36m(LLMRayActor pid=289229)[0m INFO 05-30 17:07:22 [llm_engine.py:241] Initializing a V0 LLM engine (v0.8.2.dev76+gf68cce8) with config: model='/mnt/petrelfs/luyiting/OmniCaptioner/output/output_sft_cot/', speculative_config=None, tokenizer='/mnt/petrelfs/luyiting/OmniCaptioner/output/output_sft_cot/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=43, served_model_name=/mnt/petrelfs/luyiting/OmniCaptioner/output/output_sft_cot/, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
-[36m(pid=289228)[0m INFO 05-30 17:06:56 [__init__.py:239] Automatically detected platform cuda.[32m [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)[0m
-[36m(LLMRayActor pid=289226)[0m INFO 05-30 17:07:22 [config.py:585] This model supports multiple tasks: {'reward', 'score', 'embed', 'classify', 'generate'}. Defaulting to 'generate'.
-[36m(LLMRayActor pid=289227)[0m INFO 05-30 17:07:22 [config.py:585] This model supports multiple tasks: {'embed', 'reward', 'classify', 'score', 'generate'}. Defaulting to 'generate'.
-[36m(LLMRayActor pid=289228)[0m INFO 05-30 17:07:22 [config.py:585] This model supports multiple tasks: {'generate', 'classify', 'reward', 'embed', 'score'}. Defaulting to 'generate'.
-[36m(LLMRayActor pid=289229)[0m [2025-05-30 17:07:25,107] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
-[36m(LLMRayActor pid=289229)[0m INFO 05-30 17:07:29 [cuda.py:293] Using Flash Attention backend.
-[36m(LLMRayActor pid=289228)[0m WARNING 05-30 17:07:22 [arg_utils.py:1846] VLLM_ATTENTION_BACKEND=triton is not supported by the V1 Engine. Falling back to V0. We recommend to remove VLLM_ATTENTION_BACKEND=triton from your config in favor of the V1 Engine.[32m [repeated 3x across cluster][0m
-[36m(LLMRayActor pid=289228)[0m WARNING 05-30 17:07:22 [arg_utils.py:1745] --enable-prefix-caching is not supported for multimodal models in V0 and has been disabled.[32m [repeated 3x across cluster][0m
-[36m(LLMRayActor pid=289228)[0m INFO 05-30 17:07:22 [llm_engine.py:241] Initializing a V0 LLM engine (v0.8.2.dev76+gf68cce8) with config: model='/mnt/petrelfs/luyiting/OmniCaptioner/output/output_sft_cot/', speculative_config=None, tokenizer='/mnt/petrelfs/luyiting/OmniCaptioner/output/output_sft_cot/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=45, served_model_name=/mnt/petrelfs/luyiting/OmniCaptioner/output/output_sft_cot/, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, [32m [repeated 3x across cluster][0m
-[36m(LLMRayActor pid=289227)[0m INFO 05-30 17:07:32 [parallel_state.py:967] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
-[36m(LLMRayActor pid=289227)[0m INFO 05-30 17:07:32 [model_runner.py:1110] Starting to load model /mnt/petrelfs/luyiting/OmniCaptioner/output/output_sft_cot/...
-[36m(LLMRayActor pid=289228)[0m [2025-05-30 17:07:25,107] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)[32m [repeated 3x across cluster][0m
-[36m(LLMRayActor pid=289227)[0m INFO 05-30 17:07:33 [config.py:3229] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256] is overridden by config [256, 128, 2, 1, 4, 136, 8, 144, 16, 152, 24, 160, 32, 168, 40, 176, 48, 184, 56, 192, 64, 200, 72, 208, 80, 216, 88, 120, 224, 96, 232, 104, 240, 112, 248]
-[36m(LLMRayActor pid=289227)[0m
-Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
-[36m(LLMRayActor pid=289229)[0m
-Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:03,  1.28s/it]
-[36m(LLMRayActor pid=289228)[0m INFO 05-30 17:07:29 [cuda.py:293] Using Flash Attention backend.[32m [repeated 3x across cluster][0m
-[36m(LLMRayActor pid=289226)[0m CUDA Error: out of memory at /mnt/petrelfs/luyiting/MultiAgentEval/vllm/csrc/cumem_allocator.cpp:62
-Traceback (most recent call last):
-  File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/runpy.py", line 196, in _run_module_as_main
-    return _run_code(code, main_globals, None,
-  File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/runpy.py", line 86, in _run_code
-    exec(code, run_globals)
-  File "/mnt/petrelfs/luyiting/tmp_ray/session_2025-05-30_17-05-52_163961_268377/runtime_resources/working_dir_files/_ray_pkg_0a781bcb0b13d679/openrlhf/cli/train_ppo_ray.py", line 486, in <module>
-    train(args)
-  File "/mnt/petrelfs/luyiting/tmp_ray/session_2025-05-30_17-05-52_163961_268377/runtime_resources/working_dir_files/_ray_pkg_0a781bcb0b13d679/openrlhf/cli/train_ppo_ray.py", line 86, in train
-    vllm_engines = create_vllm_engines(
-  File "/mnt/petrelfs/luyiting/tmp_ray/session_2025-05-30_17-05-52_163961_268377/runtime_resources/working_dir_files/_ray_pkg_0a781bcb0b13d679/openrlhf/trainer/ray/vllm_engine.py", line 189, in create_vllm_engines
-    batch_vllm_engine_call(vllm_engines, "sleep", rank_0_only=False)
-  File "/mnt/petrelfs/luyiting/tmp_ray/session_2025-05-30_17-05-52_163961_268377/runtime_resources/working_dir_files/_ray_pkg_0a781bcb0b13d679/openrlhf/trainer/ray/vllm_engine.py", line 216, in batch_vllm_engine_call
-    return ray.get(refs)
-  File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
-    return fn(*args, **kwargs)
-  File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
-    return func(*args, **kwargs)
-  File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/ray/_private/worker.py", line 2782, in get
-    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
-  File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/ray/_private/worker.py", line 931, in get_objects
-    raise value
-ray.exceptions.ActorDiedError: The actor died because of an error raised in its creation task, [36mray::LLMRayActor.__init__()[39m (pid=289226, ip=10.140.0.158, actor_id=00ecb757652fc9ecdd0f10d302000000, repr=<openrlhf.trainer.ray.vllm_engine.LLMRayActor object at 0x7f2c10149f60>)
-  File "/mnt/petrelfs/luyiting/tmp_ray/session_2025-05-30_17-05-52_163961_268377/runtime_resources/working_dir_files/_ray_pkg_0a781bcb0b13d679/openrlhf/trainer/ray/vllm_engine.py", line 54, in __init__
-    self.llm = LLM(*args, **kwargs)
-  File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/utils.py", line 1037, in inner
-    return fn(*args, **kwargs)
-  File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/entrypoints/llm.py", line 243, in __init__
-    self.llm_engine = LLMEngine.from_engine_args(
-  File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/engine/llm_engine.py", line 520, in from_engine_args
-    return engine_cls.from_vllm_config(
-  File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/engine/llm_engine.py", line 496, in from_vllm_config
-    return cls(
-  File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/engine/llm_engine.py", line 280, in __init__
-    self.model_executor = executor_class(vllm_config=vllm_config, )
-  File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/executor/executor_base.py", line 52, in __init__
-    self._init_executor()
-  File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/executor/uniproc_executor.py", line 47, in _init_executor
-    self.collective_rpc("load_model")
-  File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
-    answer = run_method(self.driver_worker, method, args, kwargs)
-  File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/utils.py", line 2255, in run_method
-    return func(*args, **kwargs)
-  File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/worker/worker.py", line 183, in load_model
-    self.model_runner.load_model()
-  File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/worker/model_runner.py", line 1113, in load_model
-    self.model = get_model(vllm_config=self.vllm_config)
-  File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
-    return loader.load_model(vllm_config=vllm_config)
-  File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/model_loader/loader.py", line 423, in load_model
-    model = _initialize_model(vllm_config=vllm_config)
-  File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/model_loader/loader.py", line 126, in _initialize_model
-    return model_class(vllm_config=vllm_config, prefix=prefix)
-  File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/qwen2_5_vl.py", line 802, in __init__
-    self.language_model = init_vllm_registered_model(
-  File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/utils.py", line 260, in init_vllm_registered_model
-    return _initialize_model(vllm_config=vllm_config, prefix=prefix)
-  File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/model_loader/loader.py", line 126, in _initialize_model
-    return model_class(vllm_config=vllm_config, prefix=prefix)
-  File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/qwen2.py", line 431, in __init__
-    self.model = Qwen2Model(vllm_config=vllm_config,
-  File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/compilation/decorators.py", line 151, in __init__
-    old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
-  File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/qwen2.py", line 300, in __init__
-    self.start_layer, self.end_layer, self.layers = make_layers(
-  File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/utils.py", line 557, in make_layers
-    [PPMissingLayer() for _ in range(start_layer)] + [
-  File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/utils.py", line 558, in <listcomp>
-    maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
-  File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/qwen2.py", line 302, in <lambda>
-    lambda prefix: Qwen2DecoderLayer(config=config,
-  File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/qwen2.py", line 206, in __init__
-    self.self_attn = Qwen2Attention(
-  File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/qwen2.py", line 136, in __init__
-    self.qkv_proj = QKVParallelLinear(
-  File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/layers/linear.py", line 833, in __init__
-    super().__init__(input_size=input_size,
-  File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/layers/linear.py", line 398, in __init__
-    self.quant_method.create_weights(
-  File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/layers/linear.py", line 178, in create_weights
-    weight = Parameter(torch.empty(sum(output_partition_sizes),
-  File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/torch/utils/_device.py", line 104, in __torch_function__
-    return func(*args, **kwargs)
-torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacity of 79.32 GiB of which 53.56 MiB is free. Process 229250 has 63.42 GiB memory in use. Including non-PyTorch memory, this process has 15.85 GiB memory in use. Of the allocated memory 14.33 GiB is allocated by PyTorch, with 158.43 MiB allocated in private pools (e.g., CUDA Graphs), and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
-[36m(LLMRayActor pid=289226)[0m Exception raised in creation task: The actor died because of an error raised in its creation task, [36mray::LLMRayActor.__init__()[39m (pid=289226, ip=10.140.0.158, actor_id=00ecb757652fc9ecdd0f10d302000000, repr=<openrlhf.trainer.ray.vllm_engine.LLMRayActor object at 0x7f2c10149f60>)
-[36m(LLMRayActor pid=289226)[0m   File "/mnt/petrelfs/luyiting/tmp_ray/session_2025-05-30_17-05-52_163961_268377/runtime_resources/working_dir_files/_ray_pkg_0a781bcb0b13d679/openrlhf/trainer/ray/vllm_engine.py", line 54, in __init__
-[36m(LLMRayActor pid=289226)[0m     self.llm = LLM(*args, **kwargs)
-[36m(LLMRayActor pid=289226)[0m   File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/utils.py", line 1037, in inner
-[36m(LLMRayActor pid=289226)[0m     return fn(*args, **kwargs)
-[36m(LLMRayActor pid=289226)[0m   File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/entrypoints/llm.py", line 243, in __init__
-[36m(LLMRayActor pid=289226)[0m     self.llm_engine = LLMEngine.from_engine_args(
-[36m(LLMRayActor pid=289226)[0m   File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/engine/llm_engine.py", line 520, in from_engine_args
-[36m(LLMRayActor pid=289226)[0m     return engine_cls.from_vllm_config(
-[36m(LLMRayActor pid=289226)[0m   File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/engine/llm_engine.py", line 496, in from_vllm_config
-[36m(LLMRayActor pid=289226)[0m     return cls(
-[36m(LLMRayActor pid=289226)[0m   File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/engine/llm_engine.py", line 280, in __init__
-[36m(LLMRayActor pid=289226)[0m     self.model_executor = executor_class(vllm_config=vllm_config, )
-[36m(LLMRayActor pid=289226)[0m   File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/executor/executor_base.py", line 52, in __init__
-[36m(LLMRayActor pid=289226)[0m     self._init_executor()
-[36m(LLMRayActor pid=289226)[0m   File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/executor/uniproc_executor.py", line 47, in _init_executor
-[36m(LLMRayActor pid=289226)[0m     self.collective_rpc("load_model")
-[36m(LLMRayActor pid=289226)[0m   File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
-[36m(LLMRayActor pid=289226)[0m     answer = run_method(self.driver_worker, method, args, kwargs)
-[36m(LLMRayActor pid=289226)[0m   File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/utils.py", line 2255, in run_method
-[36m(LLMRayActor pid=289226)[0m     return func(*args, **kwargs)
-[36m(LLMRayActor pid=289226)[0m   File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/worker/worker.py", line 183, in load_model
-[36m(LLMRayActor pid=289226)[0m     self.model_runner.load_model()
-[36m(LLMRayActor pid=289226)[0m   File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/worker/model_runner.py", line 1113, in load_model
-[36m(LLMRayActor pid=289226)[0m     self.model = get_model(vllm_config=self.vllm_config)
-[36m(LLMRayActor pid=289226)[0m   File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
-[36m(LLMRayActor pid=289226)[0m     return loader.load_model(vllm_config=vllm_config)
-[36m(LLMRayActor pid=289226)[0m   File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/model_loader/loader.py", line 423, in load_model
-[36m(LLMRayActor pid=289226)[0m     model = _initialize_model(vllm_config=vllm_config)
-[36m(LLMRayActor pid=289226)[0m   File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/model_loader/loader.py", line 126, in _initialize_model
-[36m(LLMRayActor pid=289226)[0m     return model_class(vllm_config=vllm_config, prefix=prefix)
-[36m(LLMRayActor pid=289226)[0m   File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/qwen2_5_vl.py", line 802, in __init__
-[36m(LLMRayActor pid=289226)[0m     self.language_model = init_vllm_registered_model(
-[36m(LLMRayActor pid=289226)[0m   File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/utils.py", line 260, in init_vllm_registered_model
-[36m(LLMRayActor pid=289226)[0m     return _initialize_model(vllm_config=vllm_config, prefix=prefix)
-[36m(LLMRayActor pid=289226)[0m   File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/model_loader/loader.py", line 126, in _initialize_model
-[36m(LLMRayActor pid=289226)[0m     return model_class(vllm_config=vllm_config, prefix=prefix)
-[36m(LLMRayActor pid=289226)[0m   File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/qwen2.py", line 431, in __init__
-[36m(LLMRayActor pid=289226)[0m     self.model = Qwen2Model(vllm_config=vllm_config,
-[36m(LLMRayActor pid=289226)[0m   File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/compilation/decorators.py", line 151, in __init__
-[36m(LLMRayActor pid=289226)[0m     old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
-[36m(LLMRayActor pid=289226)[0m   File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/qwen2.py", line 300, in __init__
-[36m(LLMRayActor pid=289226)[0m     self.start_layer, self.end_layer, self.layers = make_layers(
-[36m(LLMRayActor pid=289226)[0m   File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/utils.py", line 557, in make_layers
-[36m(LLMRayActor pid=289226)[0m     [PPMissingLayer() for _ in range(start_layer)] + [
-[36m(LLMRayActor pid=289226)[0m   File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/utils.py", line 558, in <listcomp>
-[36m(LLMRayActor pid=289226)[0m     maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
-[36m(LLMRayActor pid=289226)[0m   File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/qwen2.py", line 302, in <lambda>
-[36m(LLMRayActor pid=289226)[0m     lambda prefix: Qwen2DecoderLayer(config=config,
-[36m(LLMRayActor pid=289226)[0m   File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/qwen2.py", line 206, in __init__
-[36m(LLMRayActor pid=289226)[0m     self.self_attn = Qwen2Attention(
-[36m(LLMRayActor pid=289226)[0m   File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/qwen2.py", line 136, in __init__
-[36m(LLMRayActor pid=289226)[0m     self.qkv_proj = QKVParallelLinear(
-[36m(LLMRayActor pid=289226)[0m   File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/layers/linear.py", line 833, in __init__
-[36m(LLMRayActor pid=289226)[0m     super().__init__(input_size=input_size,
-[36m(LLMRayActor pid=289226)[0m   File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/layers/linear.py", line 398, in __init__
-[36m(LLMRayActor pid=289226)[0m     self.quant_method.create_weights(
-[36m(LLMRayActor pid=289226)[0m   File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/layers/linear.py", line 178, in create_weights
-[36m(LLMRayActor pid=289226)[0m     weight = Parameter(torch.empty(sum(output_partition_sizes),
-[36m(LLMRayActor pid=289226)[0m   File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/torch/utils/_device.py", line 104, in __torch_function__
-[36m(LLMRayActor pid=289226)[0m     return func(*args, **kwargs)
-[36m(LLMRayActor pid=289226)[0m torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacity of 79.32 GiB of which 53.56 MiB is free. Process 229250 has 63.42 GiB memory in use. Including non-PyTorch memory, this process has 15.85 GiB memory in use. Of the allocated memory 14.33 GiB is allocated by PyTorch, with 158.43 MiB allocated in private pools (e.g., CUDA Graphs), and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
-[36m(LLMRayActor pid=289226)[0m INFO 05-30 17:07:34 [parallel_state.py:967] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0[32m [repeated 3x across cluster][0m
-[36m(LLMRayActor pid=289226)[0m INFO 05-30 17:07:34 [model_runner.py:1110] Starting to load model /mnt/petrelfs/luyiting/OmniCaptioner/output/output_sft_cot/...[32m [repeated 3x across cluster][0m
-[36m(LLMRayActor pid=289226)[0m INFO 05-30 17:07:35 [config.py:3229] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256] is overridden by config [256, 128, 2, 1, 4, 136, 8, 144, 16, 152, 24, 160, 32, 168, 40, 176, 48, 184, 56, 192, 64, 200, 72, 208, 80, 216, 88, 120, 224, 96, 232, 104, 240, 112, 248][32m [repeated 3x across cluster][0m
-[36m(LLMRayActor pid=289228)[0m
-Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s][32m [repeated 2x across cluster][0m
-[36m(LLMRayActor pid=289228)[0m
-Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:03,  1.27s/it][32m [repeated 2x across cluster][0m
-2025-05-30 17:07:40,909	ERR cli.py:71 -- [31m---------------------------------------[39m
-2025-05-30 17:07:40,909	ERR cli.py:72 -- [31mJob 'raysubmit_W5jtiiFQMmWH77cp' failed[39m
-2025-05-30 17:07:40,909	ERR cli.py:73 -- [31m---------------------------------------[39m
-2025-05-30 17:07:40,909	INFO cli.py:86 -- Status message: Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars):
-[36m(LLMRayActor pid=289226)[0m   File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/torch/utils/_device.py", line 104, in __torch_function__
-[36m(LLMRayActor pid=289226)[0m     return func(*args, **kwargs)
-[36m(LLMRayActor pid=289226)[0m torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacity of 79.32 GiB of which 53.56 MiB is free. Process 229250 has 63.42 GiB memory in use. Including non-PyTorch memory, this process has 15.85 GiB memory in use. Of the allocated memory 14.33 GiB is allocated by PyTorch, with 158.43 MiB allocated in private pools (e.g., CUDA Graphs), and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
-[36m(LLMRayActor pid=289226)[0m INFO 05-30 17:07:34 [parallel_state.py:967] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0[32m [repeated 3x across cluster][0m
-[36m(LLMRayActor pid=289226)[0m INFO 05-30 17:07:34 [model_runner.py:1110] Starting to load model /mnt/petrelfs/luyiting/OmniCaptioner/output/output_sft_cot/...[32m [repeated 3x across cluster][0m
-[36m(LLMRayActor pid=289226)[0m INFO 05-30 17:07:35 [config.py:3229] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256] is overridden by config [256, 128, 2, 1, 4, 136, 8, 144, 16, 152, 24, 160, 32, 168, 40, 176, 48, 184, 56, 192, 64, 200, 72, 208, 80, 216, 88, 120, 224, 96, 232, 104, 240, 112, 248][32m [repeated 3x across cluster][0m
-[36m(LLMRayActor pid=289228)[0m
-Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s][32m [repeated 2x across cluster][0m
-[36m(LLMRayActor pid=289228)[0m
-Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:03,  1.27s/it][32m [repeated 2x across cluster][0m

logs/20250530_171246/process_pids.txt DELETED Viewed

	@@ -1,2 +0,0 @@
1	- Remote RM PID: 311979
2	- Train PID: 311980

logs/20250530_171246/remote_rm_qa.log DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:24c9b3b226ea3b24ce83505dbca946ffd752efc3dd7d0c7e6420fcc3efac9202
-size 147836970

logs/20250530_171246/train.log DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:fec8e338c58046e39d85642abb654259a62a2bf7a500a3ae347ff7a49a5f71a6
-size 32941037

logs/20250602_151514/process_pids.txt DELETED Viewed

	@@ -1,2 +0,0 @@
1	- Remote RM PID: 27320
2	- Train PID: 27321

logs/20250602_151514/remote_rm_qa.log DELETED Viewed

The diff for this file is too large to render. See raw diff

logs/20250602_151514/train.log DELETED Viewed

The diff for this file is too large to render. See raw diff

logs/20250602_160801/process_pids.txt DELETED Viewed

	@@ -1,2 +0,0 @@
1	- Remote RM PID: 120812
2	- Train PID: 120813

logs/20250602_160801/remote_rm_qa.log DELETED Viewed

@@ -1,9 +0,0 @@
-[2025-06-02 16:09:41,172] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
-load dataset success
-Starting reward server with reward type: gaussian_rbf
-RewardCalculator initialized with sigma=0.8, epsilon_bonus=0.3
-Dataset-specific adjustments disabled
- * Serving Flask app 'math_verifier_wolatex_regress_specifc'
- * Debug mode: off
-Address already in use
-Port 2099 is in use by another program. Either identify and stop that program, or start the server with a different port.

logs/20250602_160801/train.log DELETED Viewed

@@ -1,17 +0,0 @@
-2025-06-02 16:09:11,176	INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_cda7e53a52438a70.zip.
-2025-06-02 16:09:11,176	INFO packaging.py:575 -- Creating a file package for local module '/mnt/petrelfs/luyiting/MultiAgentEval/lmm-r1'.
-2025-06-02 16:08:58,977	INFO cli.py:39 -- [37mJob submission server address[39m: [1mhttp://127.0.0.1:6025[22m
-2025-06-02 16:09:52,257	SUCC cli.py:63 -- [32m-------------------------------------------------------[39m
-2025-06-02 16:09:52,257	SUCC cli.py:64 -- [32mJob 'raysubmit_zJp8BHzeYdb1T6E9' submitted successfully[39m
-2025-06-02 16:09:52,257	SUCC cli.py:65 -- [32m-------------------------------------------------------[39m
-2025-06-02 16:09:52,257	INFO cli.py:289 -- [36mNext steps[39m
-2025-06-02 16:09:52,257	INFO cli.py:290 -- Query the logs of the job:
-2025-06-02 16:09:52,257	INFO cli.py:292 -- [1mray job logs raysubmit_zJp8BHzeYdb1T6E9[22m
-2025-06-02 16:09:52,257	INFO cli.py:294 -- Query the status of the job:
-2025-06-02 16:09:52,257	INFO cli.py:296 -- [1mray job status raysubmit_zJp8BHzeYdb1T6E9[22m
-2025-06-02 16:09:52,257	INFO cli.py:298 -- Request the job to be stopped:
-2025-06-02 16:09:52,257	INFO cli.py:300 -- [1mray job stop raysubmit_zJp8BHzeYdb1T6E9[22m
-2025-06-02 16:09:52,262	INFO cli.py:307 -- Tailing logs until the job exits (disable with --no-wait):
-2025-06-02 16:09:50,572	INFO job_manager.py:531 -- Runtime env is setting up.
-2025-06-02 16:10:26,125	INFO cli.py:89 -- Status for job 'raysubmit_zJp8BHzeYdb1T6E9': RUNNING
-2025-06-02 16:10:26,125	INFO cli.py:91 -- Status message: Job is currently running.

logs/20250602_161106/process_pids.txt DELETED Viewed

	@@ -1,2 +0,0 @@
1	- Remote RM PID: 151580
2	- Train PID: 151581

logs/20250602_161106/remote_rm_qa.log DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:51b798dda8b23aefeca4bc5901a779c09310625b77939cfd49ebd19781a56448
-size 83905836

logs/20250602_161106/train.log DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:4af1a85d0f95248f6760f4b7e81e02ceec648792c0c9658d9f180f04bbe33550
-size 27665611