Delete logs
Browse files- logs/20250528_162012/process_pids.txt +0 -2
- logs/20250528_162012/remote_rm_qa.log +0 -3
- logs/20250528_162012/train.log +0 -3
- logs/20250530_170544/process_pids.txt +0 -2
- logs/20250530_170544/remote_rm_qa.log +0 -12
- logs/20250530_170544/train.log +0 -211
- logs/20250530_171246/process_pids.txt +0 -2
- logs/20250530_171246/remote_rm_qa.log +0 -3
- logs/20250530_171246/train.log +0 -3
- logs/20250602_151514/process_pids.txt +0 -2
- logs/20250602_151514/remote_rm_qa.log +0 -0
- logs/20250602_151514/train.log +0 -0
- logs/20250602_160801/process_pids.txt +0 -2
- logs/20250602_160801/remote_rm_qa.log +0 -9
- logs/20250602_160801/train.log +0 -17
- logs/20250602_161106/process_pids.txt +0 -2
- logs/20250602_161106/remote_rm_qa.log +0 -3
- logs/20250602_161106/train.log +0 -3
logs/20250528_162012/process_pids.txt
DELETED
|
@@ -1,2 +0,0 @@
|
|
| 1 |
-
Remote RM PID: 152057
|
| 2 |
-
Train PID: 152058
|
|
|
|
|
|
|
|
|
logs/20250528_162012/remote_rm_qa.log
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:ebe8dec7692710158e31a50f8cd02f697d71f55f2faf30af719369115c557ea5
|
| 3 |
-
size 78232454
|
|
|
|
|
|
|
|
|
|
|
|
logs/20250528_162012/train.log
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:bf5b4ff1c024d422b2c8485f55e4841aeac34550318b9cc7c96012175757ef6c
|
| 3 |
-
size 18029525
|
|
|
|
|
|
|
|
|
|
|
|
logs/20250530_170544/process_pids.txt
DELETED
|
@@ -1,2 +0,0 @@
|
|
| 1 |
-
Remote RM PID: 272052
|
| 2 |
-
Train PID: 272053
|
|
|
|
|
|
|
|
|
logs/20250530_170544/remote_rm_qa.log
DELETED
|
@@ -1,12 +0,0 @@
|
|
| 1 |
-
[2025-05-30 17:06:16,649] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
|
| 2 |
-
load dataset success
|
| 3 |
-
Starting reward server with reward type: gaussian_rbf
|
| 4 |
-
RewardCalculator initialized with sigma=0.8, epsilon_bonus=0.3
|
| 5 |
-
Dataset-specific adjustments disabled
|
| 6 |
-
* Serving Flask app 'math_verifier_wolatex_regress_specifc'
|
| 7 |
-
* Debug mode: off
|
| 8 |
-
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
|
| 9 |
-
* Running on all addresses (0.0.0.0)
|
| 10 |
-
* Running on http://127.0.0.1:2099
|
| 11 |
-
* Running on http://10.140.0.158:2099
|
| 12 |
-
Press CTRL+C to quit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
logs/20250530_170544/train.log
DELETED
|
@@ -1,211 +0,0 @@
|
|
| 1 |
-
2025-05-30 17:06:04,948 INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_0a781bcb0b13d679.zip.
|
| 2 |
-
2025-05-30 17:06:04,949 INFO packaging.py:575 -- Creating a file package for local module '/mnt/petrelfs/luyiting/MultiAgentEval/lmm-r1'.
|
| 3 |
-
2025-05-30 17:06:04,029 INFO cli.py:39 -- [37mJob submission server address[39m: [1mhttp://127.0.0.1:6025[22m
|
| 4 |
-
2025-05-30 17:06:09,743 SUCC cli.py:63 -- [32m-------------------------------------------------------[39m
|
| 5 |
-
2025-05-30 17:06:09,743 SUCC cli.py:64 -- [32mJob 'raysubmit_W5jtiiFQMmWH77cp' submitted successfully[39m
|
| 6 |
-
2025-05-30 17:06:09,743 SUCC cli.py:65 -- [32m-------------------------------------------------------[39m
|
| 7 |
-
2025-05-30 17:06:09,743 INFO cli.py:289 -- [36mNext steps[39m
|
| 8 |
-
2025-05-30 17:06:09,743 INFO cli.py:290 -- Query the logs of the job:
|
| 9 |
-
2025-05-30 17:06:09,743 INFO cli.py:292 -- [1mray job logs raysubmit_W5jtiiFQMmWH77cp[22m
|
| 10 |
-
2025-05-30 17:06:09,743 INFO cli.py:294 -- Query the status of the job:
|
| 11 |
-
2025-05-30 17:06:09,743 INFO cli.py:296 -- [1mray job status raysubmit_W5jtiiFQMmWH77cp[22m
|
| 12 |
-
2025-05-30 17:06:09,743 INFO cli.py:298 -- Request the job to be stopped:
|
| 13 |
-
2025-05-30 17:06:09,743 INFO cli.py:300 -- [1mray job stop raysubmit_W5jtiiFQMmWH77cp[22m
|
| 14 |
-
2025-05-30 17:06:09,747 INFO cli.py:307 -- Tailing logs until the job exits (disable with --no-wait):
|
| 15 |
-
2025-05-30 17:06:09,268 INFO job_manager.py:531 -- Runtime env is setting up.
|
| 16 |
-
[2025-05-30 17:06:30,019] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
|
| 17 |
-
INFO 05-30 17:06:34 [__init__.py:239] Automatically detected platform cuda.
|
| 18 |
-
2025-05-30 17:06:35,206 INFO worker.py:1520 -- Using address 10.140.0.158:6033 set in the environment variable RAY_ADDRESS
|
| 19 |
-
2025-05-30 17:06:35,208 INFO worker.py:1660 -- Connecting to existing Ray cluster at address: 10.140.0.158:6033...
|
| 20 |
-
2025-05-30 17:06:35,229 INFO worker.py:1843 -- Connected to Ray cluster. View the dashboard at [1m[32m10.140.0.158:6025 [39m[22m
|
| 21 |
-
[36m(pid=289229)[0m INFO 05-30 17:06:56 [__init__.py:239] Automatically detected platform cuda.
|
| 22 |
-
[36m(LLMRayActor pid=289229)[0m INFO 05-30 17:07:22 [config.py:585] This model supports multiple tasks: {'generate', 'classify', 'score', 'embed', 'reward'}. Defaulting to 'generate'.
|
| 23 |
-
[36m(LLMRayActor pid=289229)[0m WARNING 05-30 17:07:22 [arg_utils.py:1846] VLLM_ATTENTION_BACKEND=triton is not supported by the V1 Engine. Falling back to V0. We recommend to remove VLLM_ATTENTION_BACKEND=triton from your config in favor of the V1 Engine.
|
| 24 |
-
[36m(LLMRayActor pid=289229)[0m WARNING 05-30 17:07:22 [arg_utils.py:1745] --enable-prefix-caching is not supported for multimodal models in V0 and has been disabled.
|
| 25 |
-
[36m(LLMRayActor pid=289229)[0m INFO 05-30 17:07:22 [llm_engine.py:241] Initializing a V0 LLM engine (v0.8.2.dev76+gf68cce8) with config: model='/mnt/petrelfs/luyiting/OmniCaptioner/output/output_sft_cot/', speculative_config=None, tokenizer='/mnt/petrelfs/luyiting/OmniCaptioner/output/output_sft_cot/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=43, served_model_name=/mnt/petrelfs/luyiting/OmniCaptioner/output/output_sft_cot/, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
|
| 26 |
-
[36m(pid=289228)[0m INFO 05-30 17:06:56 [__init__.py:239] Automatically detected platform cuda.[32m [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)[0m
|
| 27 |
-
[36m(LLMRayActor pid=289226)[0m INFO 05-30 17:07:22 [config.py:585] This model supports multiple tasks: {'reward', 'score', 'embed', 'classify', 'generate'}. Defaulting to 'generate'.
|
| 28 |
-
[36m(LLMRayActor pid=289227)[0m INFO 05-30 17:07:22 [config.py:585] This model supports multiple tasks: {'embed', 'reward', 'classify', 'score', 'generate'}. Defaulting to 'generate'.
|
| 29 |
-
[36m(LLMRayActor pid=289228)[0m INFO 05-30 17:07:22 [config.py:585] This model supports multiple tasks: {'generate', 'classify', 'reward', 'embed', 'score'}. Defaulting to 'generate'.
|
| 30 |
-
[36m(LLMRayActor pid=289229)[0m [2025-05-30 17:07:25,107] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
|
| 31 |
-
[36m(LLMRayActor pid=289229)[0m INFO 05-30 17:07:29 [cuda.py:293] Using Flash Attention backend.
|
| 32 |
-
[36m(LLMRayActor pid=289228)[0m WARNING 05-30 17:07:22 [arg_utils.py:1846] VLLM_ATTENTION_BACKEND=triton is not supported by the V1 Engine. Falling back to V0. We recommend to remove VLLM_ATTENTION_BACKEND=triton from your config in favor of the V1 Engine.[32m [repeated 3x across cluster][0m
|
| 33 |
-
[36m(LLMRayActor pid=289228)[0m WARNING 05-30 17:07:22 [arg_utils.py:1745] --enable-prefix-caching is not supported for multimodal models in V0 and has been disabled.[32m [repeated 3x across cluster][0m
|
| 34 |
-
[36m(LLMRayActor pid=289228)[0m INFO 05-30 17:07:22 [llm_engine.py:241] Initializing a V0 LLM engine (v0.8.2.dev76+gf68cce8) with config: model='/mnt/petrelfs/luyiting/OmniCaptioner/output/output_sft_cot/', speculative_config=None, tokenizer='/mnt/petrelfs/luyiting/OmniCaptioner/output/output_sft_cot/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=45, served_model_name=/mnt/petrelfs/luyiting/OmniCaptioner/output/output_sft_cot/, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, [32m [repeated 3x across cluster][0m
|
| 35 |
-
[36m(LLMRayActor pid=289227)[0m INFO 05-30 17:07:32 [parallel_state.py:967] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
|
| 36 |
-
[36m(LLMRayActor pid=289227)[0m INFO 05-30 17:07:32 [model_runner.py:1110] Starting to load model /mnt/petrelfs/luyiting/OmniCaptioner/output/output_sft_cot/...
|
| 37 |
-
[36m(LLMRayActor pid=289228)[0m [2025-05-30 17:07:25,107] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)[32m [repeated 3x across cluster][0m
|
| 38 |
-
[36m(LLMRayActor pid=289227)[0m INFO 05-30 17:07:33 [config.py:3229] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256] is overridden by config [256, 128, 2, 1, 4, 136, 8, 144, 16, 152, 24, 160, 32, 168, 40, 176, 48, 184, 56, 192, 64, 200, 72, 208, 80, 216, 88, 120, 224, 96, 232, 104, 240, 112, 248]
|
| 39 |
-
[36m(LLMRayActor pid=289227)[0m
|
| 40 |
-
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
|
| 41 |
-
[36m(LLMRayActor pid=289229)[0m
|
| 42 |
-
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:01<00:03, 1.28s/it]
|
| 43 |
-
[36m(LLMRayActor pid=289228)[0m INFO 05-30 17:07:29 [cuda.py:293] Using Flash Attention backend.[32m [repeated 3x across cluster][0m
|
| 44 |
-
[36m(LLMRayActor pid=289226)[0m CUDA Error: out of memory at /mnt/petrelfs/luyiting/MultiAgentEval/vllm/csrc/cumem_allocator.cpp:62
|
| 45 |
-
Traceback (most recent call last):
|
| 46 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/runpy.py", line 196, in _run_module_as_main
|
| 47 |
-
return _run_code(code, main_globals, None,
|
| 48 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/runpy.py", line 86, in _run_code
|
| 49 |
-
exec(code, run_globals)
|
| 50 |
-
File "/mnt/petrelfs/luyiting/tmp_ray/session_2025-05-30_17-05-52_163961_268377/runtime_resources/working_dir_files/_ray_pkg_0a781bcb0b13d679/openrlhf/cli/train_ppo_ray.py", line 486, in <module>
|
| 51 |
-
train(args)
|
| 52 |
-
File "/mnt/petrelfs/luyiting/tmp_ray/session_2025-05-30_17-05-52_163961_268377/runtime_resources/working_dir_files/_ray_pkg_0a781bcb0b13d679/openrlhf/cli/train_ppo_ray.py", line 86, in train
|
| 53 |
-
vllm_engines = create_vllm_engines(
|
| 54 |
-
File "/mnt/petrelfs/luyiting/tmp_ray/session_2025-05-30_17-05-52_163961_268377/runtime_resources/working_dir_files/_ray_pkg_0a781bcb0b13d679/openrlhf/trainer/ray/vllm_engine.py", line 189, in create_vllm_engines
|
| 55 |
-
batch_vllm_engine_call(vllm_engines, "sleep", rank_0_only=False)
|
| 56 |
-
File "/mnt/petrelfs/luyiting/tmp_ray/session_2025-05-30_17-05-52_163961_268377/runtime_resources/working_dir_files/_ray_pkg_0a781bcb0b13d679/openrlhf/trainer/ray/vllm_engine.py", line 216, in batch_vllm_engine_call
|
| 57 |
-
return ray.get(refs)
|
| 58 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
|
| 59 |
-
return fn(*args, **kwargs)
|
| 60 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
|
| 61 |
-
return func(*args, **kwargs)
|
| 62 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/ray/_private/worker.py", line 2782, in get
|
| 63 |
-
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
|
| 64 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/ray/_private/worker.py", line 931, in get_objects
|
| 65 |
-
raise value
|
| 66 |
-
ray.exceptions.ActorDiedError: The actor died because of an error raised in its creation task, [36mray::LLMRayActor.__init__()[39m (pid=289226, ip=10.140.0.158, actor_id=00ecb757652fc9ecdd0f10d302000000, repr=<openrlhf.trainer.ray.vllm_engine.LLMRayActor object at 0x7f2c10149f60>)
|
| 67 |
-
File "/mnt/petrelfs/luyiting/tmp_ray/session_2025-05-30_17-05-52_163961_268377/runtime_resources/working_dir_files/_ray_pkg_0a781bcb0b13d679/openrlhf/trainer/ray/vllm_engine.py", line 54, in __init__
|
| 68 |
-
self.llm = LLM(*args, **kwargs)
|
| 69 |
-
File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/utils.py", line 1037, in inner
|
| 70 |
-
return fn(*args, **kwargs)
|
| 71 |
-
File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/entrypoints/llm.py", line 243, in __init__
|
| 72 |
-
self.llm_engine = LLMEngine.from_engine_args(
|
| 73 |
-
File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/engine/llm_engine.py", line 520, in from_engine_args
|
| 74 |
-
return engine_cls.from_vllm_config(
|
| 75 |
-
File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/engine/llm_engine.py", line 496, in from_vllm_config
|
| 76 |
-
return cls(
|
| 77 |
-
File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/engine/llm_engine.py", line 280, in __init__
|
| 78 |
-
self.model_executor = executor_class(vllm_config=vllm_config, )
|
| 79 |
-
File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/executor/executor_base.py", line 52, in __init__
|
| 80 |
-
self._init_executor()
|
| 81 |
-
File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/executor/uniproc_executor.py", line 47, in _init_executor
|
| 82 |
-
self.collective_rpc("load_model")
|
| 83 |
-
File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
|
| 84 |
-
answer = run_method(self.driver_worker, method, args, kwargs)
|
| 85 |
-
File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/utils.py", line 2255, in run_method
|
| 86 |
-
return func(*args, **kwargs)
|
| 87 |
-
File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/worker/worker.py", line 183, in load_model
|
| 88 |
-
self.model_runner.load_model()
|
| 89 |
-
File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/worker/model_runner.py", line 1113, in load_model
|
| 90 |
-
self.model = get_model(vllm_config=self.vllm_config)
|
| 91 |
-
File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
|
| 92 |
-
return loader.load_model(vllm_config=vllm_config)
|
| 93 |
-
File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/model_loader/loader.py", line 423, in load_model
|
| 94 |
-
model = _initialize_model(vllm_config=vllm_config)
|
| 95 |
-
File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/model_loader/loader.py", line 126, in _initialize_model
|
| 96 |
-
return model_class(vllm_config=vllm_config, prefix=prefix)
|
| 97 |
-
File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/qwen2_5_vl.py", line 802, in __init__
|
| 98 |
-
self.language_model = init_vllm_registered_model(
|
| 99 |
-
File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/utils.py", line 260, in init_vllm_registered_model
|
| 100 |
-
return _initialize_model(vllm_config=vllm_config, prefix=prefix)
|
| 101 |
-
File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/model_loader/loader.py", line 126, in _initialize_model
|
| 102 |
-
return model_class(vllm_config=vllm_config, prefix=prefix)
|
| 103 |
-
File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/qwen2.py", line 431, in __init__
|
| 104 |
-
self.model = Qwen2Model(vllm_config=vllm_config,
|
| 105 |
-
File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/compilation/decorators.py", line 151, in __init__
|
| 106 |
-
old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
|
| 107 |
-
File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/qwen2.py", line 300, in __init__
|
| 108 |
-
self.start_layer, self.end_layer, self.layers = make_layers(
|
| 109 |
-
File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/utils.py", line 557, in make_layers
|
| 110 |
-
[PPMissingLayer() for _ in range(start_layer)] + [
|
| 111 |
-
File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/utils.py", line 558, in <listcomp>
|
| 112 |
-
maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
|
| 113 |
-
File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/qwen2.py", line 302, in <lambda>
|
| 114 |
-
lambda prefix: Qwen2DecoderLayer(config=config,
|
| 115 |
-
File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/qwen2.py", line 206, in __init__
|
| 116 |
-
self.self_attn = Qwen2Attention(
|
| 117 |
-
File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/qwen2.py", line 136, in __init__
|
| 118 |
-
self.qkv_proj = QKVParallelLinear(
|
| 119 |
-
File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/layers/linear.py", line 833, in __init__
|
| 120 |
-
super().__init__(input_size=input_size,
|
| 121 |
-
File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/layers/linear.py", line 398, in __init__
|
| 122 |
-
self.quant_method.create_weights(
|
| 123 |
-
File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/layers/linear.py", line 178, in create_weights
|
| 124 |
-
weight = Parameter(torch.empty(sum(output_partition_sizes),
|
| 125 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/torch/utils/_device.py", line 104, in __torch_function__
|
| 126 |
-
return func(*args, **kwargs)
|
| 127 |
-
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacity of 79.32 GiB of which 53.56 MiB is free. Process 229250 has 63.42 GiB memory in use. Including non-PyTorch memory, this process has 15.85 GiB memory in use. Of the allocated memory 14.33 GiB is allocated by PyTorch, with 158.43 MiB allocated in private pools (e.g., CUDA Graphs), and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
|
| 128 |
-
[36m(LLMRayActor pid=289226)[0m Exception raised in creation task: The actor died because of an error raised in its creation task, [36mray::LLMRayActor.__init__()[39m (pid=289226, ip=10.140.0.158, actor_id=00ecb757652fc9ecdd0f10d302000000, repr=<openrlhf.trainer.ray.vllm_engine.LLMRayActor object at 0x7f2c10149f60>)
|
| 129 |
-
[36m(LLMRayActor pid=289226)[0m File "/mnt/petrelfs/luyiting/tmp_ray/session_2025-05-30_17-05-52_163961_268377/runtime_resources/working_dir_files/_ray_pkg_0a781bcb0b13d679/openrlhf/trainer/ray/vllm_engine.py", line 54, in __init__
|
| 130 |
-
[36m(LLMRayActor pid=289226)[0m self.llm = LLM(*args, **kwargs)
|
| 131 |
-
[36m(LLMRayActor pid=289226)[0m File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/utils.py", line 1037, in inner
|
| 132 |
-
[36m(LLMRayActor pid=289226)[0m return fn(*args, **kwargs)
|
| 133 |
-
[36m(LLMRayActor pid=289226)[0m File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/entrypoints/llm.py", line 243, in __init__
|
| 134 |
-
[36m(LLMRayActor pid=289226)[0m self.llm_engine = LLMEngine.from_engine_args(
|
| 135 |
-
[36m(LLMRayActor pid=289226)[0m File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/engine/llm_engine.py", line 520, in from_engine_args
|
| 136 |
-
[36m(LLMRayActor pid=289226)[0m return engine_cls.from_vllm_config(
|
| 137 |
-
[36m(LLMRayActor pid=289226)[0m File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/engine/llm_engine.py", line 496, in from_vllm_config
|
| 138 |
-
[36m(LLMRayActor pid=289226)[0m return cls(
|
| 139 |
-
[36m(LLMRayActor pid=289226)[0m File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/engine/llm_engine.py", line 280, in __init__
|
| 140 |
-
[36m(LLMRayActor pid=289226)[0m self.model_executor = executor_class(vllm_config=vllm_config, )
|
| 141 |
-
[36m(LLMRayActor pid=289226)[0m File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/executor/executor_base.py", line 52, in __init__
|
| 142 |
-
[36m(LLMRayActor pid=289226)[0m self._init_executor()
|
| 143 |
-
[36m(LLMRayActor pid=289226)[0m File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/executor/uniproc_executor.py", line 47, in _init_executor
|
| 144 |
-
[36m(LLMRayActor pid=289226)[0m self.collective_rpc("load_model")
|
| 145 |
-
[36m(LLMRayActor pid=289226)[0m File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
|
| 146 |
-
[36m(LLMRayActor pid=289226)[0m answer = run_method(self.driver_worker, method, args, kwargs)
|
| 147 |
-
[36m(LLMRayActor pid=289226)[0m File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/utils.py", line 2255, in run_method
|
| 148 |
-
[36m(LLMRayActor pid=289226)[0m return func(*args, **kwargs)
|
| 149 |
-
[36m(LLMRayActor pid=289226)[0m File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/worker/worker.py", line 183, in load_model
|
| 150 |
-
[36m(LLMRayActor pid=289226)[0m self.model_runner.load_model()
|
| 151 |
-
[36m(LLMRayActor pid=289226)[0m File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/worker/model_runner.py", line 1113, in load_model
|
| 152 |
-
[36m(LLMRayActor pid=289226)[0m self.model = get_model(vllm_config=self.vllm_config)
|
| 153 |
-
[36m(LLMRayActor pid=289226)[0m File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
|
| 154 |
-
[36m(LLMRayActor pid=289226)[0m return loader.load_model(vllm_config=vllm_config)
|
| 155 |
-
[36m(LLMRayActor pid=289226)[0m File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/model_loader/loader.py", line 423, in load_model
|
| 156 |
-
[36m(LLMRayActor pid=289226)[0m model = _initialize_model(vllm_config=vllm_config)
|
| 157 |
-
[36m(LLMRayActor pid=289226)[0m File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/model_loader/loader.py", line 126, in _initialize_model
|
| 158 |
-
[36m(LLMRayActor pid=289226)[0m return model_class(vllm_config=vllm_config, prefix=prefix)
|
| 159 |
-
[36m(LLMRayActor pid=289226)[0m File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/qwen2_5_vl.py", line 802, in __init__
|
| 160 |
-
[36m(LLMRayActor pid=289226)[0m self.language_model = init_vllm_registered_model(
|
| 161 |
-
[36m(LLMRayActor pid=289226)[0m File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/utils.py", line 260, in init_vllm_registered_model
|
| 162 |
-
[36m(LLMRayActor pid=289226)[0m return _initialize_model(vllm_config=vllm_config, prefix=prefix)
|
| 163 |
-
[36m(LLMRayActor pid=289226)[0m File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/model_loader/loader.py", line 126, in _initialize_model
|
| 164 |
-
[36m(LLMRayActor pid=289226)[0m return model_class(vllm_config=vllm_config, prefix=prefix)
|
| 165 |
-
[36m(LLMRayActor pid=289226)[0m File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/qwen2.py", line 431, in __init__
|
| 166 |
-
[36m(LLMRayActor pid=289226)[0m self.model = Qwen2Model(vllm_config=vllm_config,
|
| 167 |
-
[36m(LLMRayActor pid=289226)[0m File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/compilation/decorators.py", line 151, in __init__
|
| 168 |
-
[36m(LLMRayActor pid=289226)[0m old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
|
| 169 |
-
[36m(LLMRayActor pid=289226)[0m File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/qwen2.py", line 300, in __init__
|
| 170 |
-
[36m(LLMRayActor pid=289226)[0m self.start_layer, self.end_layer, self.layers = make_layers(
|
| 171 |
-
[36m(LLMRayActor pid=289226)[0m File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/utils.py", line 557, in make_layers
|
| 172 |
-
[36m(LLMRayActor pid=289226)[0m [PPMissingLayer() for _ in range(start_layer)] + [
|
| 173 |
-
[36m(LLMRayActor pid=289226)[0m File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/utils.py", line 558, in <listcomp>
|
| 174 |
-
[36m(LLMRayActor pid=289226)[0m maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
|
| 175 |
-
[36m(LLMRayActor pid=289226)[0m File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/qwen2.py", line 302, in <lambda>
|
| 176 |
-
[36m(LLMRayActor pid=289226)[0m lambda prefix: Qwen2DecoderLayer(config=config,
|
| 177 |
-
[36m(LLMRayActor pid=289226)[0m File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/qwen2.py", line 206, in __init__
|
| 178 |
-
[36m(LLMRayActor pid=289226)[0m self.self_attn = Qwen2Attention(
|
| 179 |
-
[36m(LLMRayActor pid=289226)[0m File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/models/qwen2.py", line 136, in __init__
|
| 180 |
-
[36m(LLMRayActor pid=289226)[0m self.qkv_proj = QKVParallelLinear(
|
| 181 |
-
[36m(LLMRayActor pid=289226)[0m File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/layers/linear.py", line 833, in __init__
|
| 182 |
-
[36m(LLMRayActor pid=289226)[0m super().__init__(input_size=input_size,
|
| 183 |
-
[36m(LLMRayActor pid=289226)[0m File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/layers/linear.py", line 398, in __init__
|
| 184 |
-
[36m(LLMRayActor pid=289226)[0m self.quant_method.create_weights(
|
| 185 |
-
[36m(LLMRayActor pid=289226)[0m File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/model_executor/layers/linear.py", line 178, in create_weights
|
| 186 |
-
[36m(LLMRayActor pid=289226)[0m weight = Parameter(torch.empty(sum(output_partition_sizes),
|
| 187 |
-
[36m(LLMRayActor pid=289226)[0m File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/torch/utils/_device.py", line 104, in __torch_function__
|
| 188 |
-
[36m(LLMRayActor pid=289226)[0m return func(*args, **kwargs)
|
| 189 |
-
[36m(LLMRayActor pid=289226)[0m torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacity of 79.32 GiB of which 53.56 MiB is free. Process 229250 has 63.42 GiB memory in use. Including non-PyTorch memory, this process has 15.85 GiB memory in use. Of the allocated memory 14.33 GiB is allocated by PyTorch, with 158.43 MiB allocated in private pools (e.g., CUDA Graphs), and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
|
| 190 |
-
[36m(LLMRayActor pid=289226)[0m INFO 05-30 17:07:34 [parallel_state.py:967] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0[32m [repeated 3x across cluster][0m
|
| 191 |
-
[36m(LLMRayActor pid=289226)[0m INFO 05-30 17:07:34 [model_runner.py:1110] Starting to load model /mnt/petrelfs/luyiting/OmniCaptioner/output/output_sft_cot/...[32m [repeated 3x across cluster][0m
|
| 192 |
-
[36m(LLMRayActor pid=289226)[0m INFO 05-30 17:07:35 [config.py:3229] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256] is overridden by config [256, 128, 2, 1, 4, 136, 8, 144, 16, 152, 24, 160, 32, 168, 40, 176, 48, 184, 56, 192, 64, 200, 72, 208, 80, 216, 88, 120, 224, 96, 232, 104, 240, 112, 248][32m [repeated 3x across cluster][0m
|
| 193 |
-
[36m(LLMRayActor pid=289228)[0m
|
| 194 |
-
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s][32m [repeated 2x across cluster][0m
|
| 195 |
-
[36m(LLMRayActor pid=289228)[0m
|
| 196 |
-
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:01<00:03, 1.27s/it][32m [repeated 2x across cluster][0m
|
| 197 |
-
2025-05-30 17:07:40,909 ERR cli.py:71 -- [31m---------------------------------------[39m
|
| 198 |
-
2025-05-30 17:07:40,909 ERR cli.py:72 -- [31mJob 'raysubmit_W5jtiiFQMmWH77cp' failed[39m
|
| 199 |
-
2025-05-30 17:07:40,909 ERR cli.py:73 -- [31m---------------------------------------[39m
|
| 200 |
-
2025-05-30 17:07:40,909 INFO cli.py:86 -- Status message: Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars):
|
| 201 |
-
[36m(LLMRayActor pid=289226)[0m File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/torch/utils/_device.py", line 104, in __torch_function__
|
| 202 |
-
[36m(LLMRayActor pid=289226)[0m return func(*args, **kwargs)
|
| 203 |
-
[36m(LLMRayActor pid=289226)[0m torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacity of 79.32 GiB of which 53.56 MiB is free. Process 229250 has 63.42 GiB memory in use. Including non-PyTorch memory, this process has 15.85 GiB memory in use. Of the allocated memory 14.33 GiB is allocated by PyTorch, with 158.43 MiB allocated in private pools (e.g., CUDA Graphs), and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
|
| 204 |
-
[36m(LLMRayActor pid=289226)[0m INFO 05-30 17:07:34 [parallel_state.py:967] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0[32m [repeated 3x across cluster][0m
|
| 205 |
-
[36m(LLMRayActor pid=289226)[0m INFO 05-30 17:07:34 [model_runner.py:1110] Starting to load model /mnt/petrelfs/luyiting/OmniCaptioner/output/output_sft_cot/...[32m [repeated 3x across cluster][0m
|
| 206 |
-
[36m(LLMRayActor pid=289226)[0m INFO 05-30 17:07:35 [config.py:3229] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256] is overridden by config [256, 128, 2, 1, 4, 136, 8, 144, 16, 152, 24, 160, 32, 168, 40, 176, 48, 184, 56, 192, 64, 200, 72, 208, 80, 216, 88, 120, 224, 96, 232, 104, 240, 112, 248][32m [repeated 3x across cluster][0m
|
| 207 |
-
[36m(LLMRayActor pid=289228)[0m
|
| 208 |
-
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s][32m [repeated 2x across cluster][0m
|
| 209 |
-
[36m(LLMRayActor pid=289228)[0m
|
| 210 |
-
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:01<00:03, 1.27s/it][32m [repeated 2x across cluster][0m
|
| 211 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
logs/20250530_171246/process_pids.txt
DELETED
|
@@ -1,2 +0,0 @@
|
|
| 1 |
-
Remote RM PID: 311979
|
| 2 |
-
Train PID: 311980
|
|
|
|
|
|
|
|
|
logs/20250530_171246/remote_rm_qa.log
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:24c9b3b226ea3b24ce83505dbca946ffd752efc3dd7d0c7e6420fcc3efac9202
|
| 3 |
-
size 147836970
|
|
|
|
|
|
|
|
|
|
|
|
logs/20250530_171246/train.log
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:fec8e338c58046e39d85642abb654259a62a2bf7a500a3ae347ff7a49a5f71a6
|
| 3 |
-
size 32941037
|
|
|
|
|
|
|
|
|
|
|
|
logs/20250602_151514/process_pids.txt
DELETED
|
@@ -1,2 +0,0 @@
|
|
| 1 |
-
Remote RM PID: 27320
|
| 2 |
-
Train PID: 27321
|
|
|
|
|
|
|
|
|
logs/20250602_151514/remote_rm_qa.log
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|
logs/20250602_151514/train.log
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|
logs/20250602_160801/process_pids.txt
DELETED
|
@@ -1,2 +0,0 @@
|
|
| 1 |
-
Remote RM PID: 120812
|
| 2 |
-
Train PID: 120813
|
|
|
|
|
|
|
|
|
logs/20250602_160801/remote_rm_qa.log
DELETED
|
@@ -1,9 +0,0 @@
|
|
| 1 |
-
[2025-06-02 16:09:41,172] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
|
| 2 |
-
load dataset success
|
| 3 |
-
Starting reward server with reward type: gaussian_rbf
|
| 4 |
-
RewardCalculator initialized with sigma=0.8, epsilon_bonus=0.3
|
| 5 |
-
Dataset-specific adjustments disabled
|
| 6 |
-
* Serving Flask app 'math_verifier_wolatex_regress_specifc'
|
| 7 |
-
* Debug mode: off
|
| 8 |
-
Address already in use
|
| 9 |
-
Port 2099 is in use by another program. Either identify and stop that program, or start the server with a different port.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
logs/20250602_160801/train.log
DELETED
|
@@ -1,17 +0,0 @@
|
|
| 1 |
-
2025-06-02 16:09:11,176 INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_cda7e53a52438a70.zip.
|
| 2 |
-
2025-06-02 16:09:11,176 INFO packaging.py:575 -- Creating a file package for local module '/mnt/petrelfs/luyiting/MultiAgentEval/lmm-r1'.
|
| 3 |
-
2025-06-02 16:08:58,977 INFO cli.py:39 -- [37mJob submission server address[39m: [1mhttp://127.0.0.1:6025[22m
|
| 4 |
-
2025-06-02 16:09:52,257 SUCC cli.py:63 -- [32m-------------------------------------------------------[39m
|
| 5 |
-
2025-06-02 16:09:52,257 SUCC cli.py:64 -- [32mJob 'raysubmit_zJp8BHzeYdb1T6E9' submitted successfully[39m
|
| 6 |
-
2025-06-02 16:09:52,257 SUCC cli.py:65 -- [32m-------------------------------------------------------[39m
|
| 7 |
-
2025-06-02 16:09:52,257 INFO cli.py:289 -- [36mNext steps[39m
|
| 8 |
-
2025-06-02 16:09:52,257 INFO cli.py:290 -- Query the logs of the job:
|
| 9 |
-
2025-06-02 16:09:52,257 INFO cli.py:292 -- [1mray job logs raysubmit_zJp8BHzeYdb1T6E9[22m
|
| 10 |
-
2025-06-02 16:09:52,257 INFO cli.py:294 -- Query the status of the job:
|
| 11 |
-
2025-06-02 16:09:52,257 INFO cli.py:296 -- [1mray job status raysubmit_zJp8BHzeYdb1T6E9[22m
|
| 12 |
-
2025-06-02 16:09:52,257 INFO cli.py:298 -- Request the job to be stopped:
|
| 13 |
-
2025-06-02 16:09:52,257 INFO cli.py:300 -- [1mray job stop raysubmit_zJp8BHzeYdb1T6E9[22m
|
| 14 |
-
2025-06-02 16:09:52,262 INFO cli.py:307 -- Tailing logs until the job exits (disable with --no-wait):
|
| 15 |
-
2025-06-02 16:09:50,572 INFO job_manager.py:531 -- Runtime env is setting up.
|
| 16 |
-
2025-06-02 16:10:26,125 INFO cli.py:89 -- Status for job 'raysubmit_zJp8BHzeYdb1T6E9': RUNNING
|
| 17 |
-
2025-06-02 16:10:26,125 INFO cli.py:91 -- Status message: Job is currently running.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
logs/20250602_161106/process_pids.txt
DELETED
|
@@ -1,2 +0,0 @@
|
|
| 1 |
-
Remote RM PID: 151580
|
| 2 |
-
Train PID: 151581
|
|
|
|
|
|
|
|
|
logs/20250602_161106/remote_rm_qa.log
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:51b798dda8b23aefeca4bc5901a779c09310625b77939cfd49ebd19781a56448
|
| 3 |
-
size 83905836
|
|
|
|
|
|
|
|
|
|
|
|
logs/20250602_161106/train.log
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:4af1a85d0f95248f6760f4b7e81e02ceec648792c0c9658d9f180f04bbe33550
|
| 3 |
-
size 27665611
|
|
|
|
|
|
|
|
|
|
|
|