Delete logs
Browse filesThis view is limited to 50 files because it contains too many changes.
See raw diff
- logs/20250526_182357/process_pids.txt +0 -2
- logs/20250526_182357/remote_rm_qa.log +0 -0
- logs/20250526_182357/train.log +0 -0
- logs/20250526_185638/process_pids.txt +0 -2
- logs/20250526_185638/remote_rm_qa.log +0 -0
- logs/20250526_185638/train.log +0 -0
- logs/20250526_191827/process_pids.txt +0 -2
- logs/20250526_191827/remote_rm_qa.log +0 -0
- logs/20250526_191827/train.log +0 -0
- logs/20250526_193312/process_pids.txt +0 -2
- logs/20250526_193312/remote_rm_qa.log +0 -9
- logs/20250526_193312/train.log +0 -2
- logs/20250526_193656/process_pids.txt +0 -2
- logs/20250526_193656/remote_rm_qa.log +0 -9
- logs/20250526_193656/train.log +0 -77
- logs/20250526_194456/process_pids.txt +0 -2
- logs/20250526_194456/remote_rm_qa.log +0 -0
- logs/20250526_194456/train.log +0 -92
- logs/20250527_011343/process_pids.txt +0 -2
- logs/20250527_011343/remote_rm_qa.log +0 -9
- logs/20250527_011343/train.log +0 -45
- logs/20250527_095510/process_pids.txt +0 -2
- logs/20250527_095510/remote_rm_qa.log +0 -3
- logs/20250527_095510/train.log +0 -0
- logs/20250527_235509/process_pids.txt +0 -2
- logs/20250527_235509/remote_rm_qa.log +0 -0
- logs/20250527_235509/train.log +0 -0
- logs/20250528_110535/process_pids.txt +0 -2
- logs/20250528_110535/remote_rm_qa.log +0 -3
- logs/20250528_110535/train.log +0 -0
- logs/20250528_161139/process_pids.txt +0 -2
- logs/20250528_161139/remote_rm_qa.log +0 -3
- logs/20250528_161139/train.log +0 -0
- logs/20250529_214257/process_pids.txt +0 -2
- logs/20250529_214257/remote_rm_qa.log +0 -3
- logs/20250529_214257/train.log +0 -0
- logs/20250611_110725/process_pids.txt +0 -2
- logs/20250611_110725/remote_rm_qa.log +0 -3
- logs/20250611_110725/train.log +0 -0
- logs/20250611_150946/process_pids.txt +0 -2
- logs/20250611_150946/remote_rm_qa.log +0 -0
- logs/20250611_150946/train.log +0 -0
- logs/20250611_160325/process_pids.txt +0 -2
- logs/20250611_160325/remote_rm_qa.log +0 -9
- logs/20250611_160325/train.log +0 -0
- logs/20250611_161239/process_pids.txt +0 -2
- logs/20250611_161239/remote_rm_qa.log +0 -9
- logs/20250611_161239/train.log +0 -188
- logs/20250611_162203/process_pids.txt +0 -2
- logs/20250611_162203/remote_rm_qa.log +0 -0
logs/20250526_182357/process_pids.txt
DELETED
|
@@ -1,2 +0,0 @@
|
|
| 1 |
-
Remote RM PID: 126386
|
| 2 |
-
Train PID: 126387
|
|
|
|
|
|
|
|
|
logs/20250526_182357/remote_rm_qa.log
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|
logs/20250526_182357/train.log
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|
logs/20250526_185638/process_pids.txt
DELETED
|
@@ -1,2 +0,0 @@
|
|
| 1 |
-
Remote RM PID: 435069
|
| 2 |
-
Train PID: 435070
|
|
|
|
|
|
|
|
|
logs/20250526_185638/remote_rm_qa.log
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|
logs/20250526_185638/train.log
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|
logs/20250526_191827/process_pids.txt
DELETED
|
@@ -1,2 +0,0 @@
|
|
| 1 |
-
Remote RM PID: 74501
|
| 2 |
-
Train PID: 74502
|
|
|
|
|
|
|
|
|
logs/20250526_191827/remote_rm_qa.log
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|
logs/20250526_191827/train.log
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|
logs/20250526_193312/process_pids.txt
DELETED
|
@@ -1,2 +0,0 @@
|
|
| 1 |
-
Remote RM PID: 279066
|
| 2 |
-
Train PID: 279067
|
|
|
|
|
|
|
|
|
logs/20250526_193312/remote_rm_qa.log
DELETED
|
@@ -1,9 +0,0 @@
|
|
| 1 |
-
[2025-05-26 19:33:56,196] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
|
| 2 |
-
load dataset success
|
| 3 |
-
* Serving Flask app 'math_verifier_wolatex'
|
| 4 |
-
* Debug mode: off
|
| 5 |
-
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
|
| 6 |
-
* Running on all addresses (0.0.0.0)
|
| 7 |
-
* Running on http://127.0.0.1:2394
|
| 8 |
-
* Running on http://10.140.0.144:2394
|
| 9 |
-
Press CTRL+C to quit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
logs/20250526_193312/train.log
DELETED
|
@@ -1,2 +0,0 @@
|
|
| 1 |
-
2025-05-26 19:33:39,938 INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_321e0871e56ca1df.zip.
|
| 2 |
-
2025-05-26 19:33:39,938 INFO packaging.py:575 -- Creating a file package for local module '/mnt/petrelfs/luyiting/MultiAgentEval/lmm-r1'.
|
|
|
|
|
|
|
|
|
logs/20250526_193656/process_pids.txt
DELETED
|
@@ -1,2 +0,0 @@
|
|
| 1 |
-
Remote RM PID: 29110
|
| 2 |
-
Train PID: 29111
|
|
|
|
|
|
|
|
|
logs/20250526_193656/remote_rm_qa.log
DELETED
|
@@ -1,9 +0,0 @@
|
|
| 1 |
-
[2025-05-26 19:37:40,391] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
|
| 2 |
-
load dataset success
|
| 3 |
-
* Serving Flask app 'math_verifier_wolatex'
|
| 4 |
-
* Debug mode: off
|
| 5 |
-
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
|
| 6 |
-
* Running on all addresses (0.0.0.0)
|
| 7 |
-
* Running on http://127.0.0.1:2394
|
| 8 |
-
* Running on http://10.140.0.167:2394
|
| 9 |
-
Press CTRL+C to quit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
logs/20250526_193656/train.log
DELETED
|
@@ -1,77 +0,0 @@
|
|
| 1 |
-
2025-05-26 19:37:24,046 INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_321e0871e56ca1df.zip.
|
| 2 |
-
2025-05-26 19:37:24,047 INFO packaging.py:575 -- Creating a file package for local module '/mnt/petrelfs/luyiting/MultiAgentEval/lmm-r1'.
|
| 3 |
-
2025-05-26 19:37:22,460 INFO cli.py:39 -- [37mJob submission server address[39m: [1mhttp://127.0.0.1:2983[22m
|
| 4 |
-
2025-05-26 19:37:32,118 SUCC cli.py:63 -- [32m-------------------------------------------------------[39m
|
| 5 |
-
2025-05-26 19:37:32,118 SUCC cli.py:64 -- [32mJob 'raysubmit_endd6V8YzvkhTPfY' submitted successfully[39m
|
| 6 |
-
2025-05-26 19:37:32,118 SUCC cli.py:65 -- [32m-------------------------------------------------------[39m
|
| 7 |
-
2025-05-26 19:37:32,118 INFO cli.py:289 -- [36mNext steps[39m
|
| 8 |
-
2025-05-26 19:37:32,118 INFO cli.py:290 -- Query the logs of the job:
|
| 9 |
-
2025-05-26 19:37:32,118 INFO cli.py:292 -- [1mray job logs raysubmit_endd6V8YzvkhTPfY[22m
|
| 10 |
-
2025-05-26 19:37:32,118 INFO cli.py:294 -- Query the status of the job:
|
| 11 |
-
2025-05-26 19:37:32,118 INFO cli.py:296 -- [1mray job status raysubmit_endd6V8YzvkhTPfY[22m
|
| 12 |
-
2025-05-26 19:37:32,118 INFO cli.py:298 -- Request the job to be stopped:
|
| 13 |
-
2025-05-26 19:37:32,119 INFO cli.py:300 -- [1mray job stop raysubmit_endd6V8YzvkhTPfY[22m
|
| 14 |
-
2025-05-26 19:37:32,121 INFO cli.py:307 -- Tailing logs until the job exits (disable with --no-wait):
|
| 15 |
-
2025-05-26 19:37:31,103 INFO job_manager.py:531 -- Runtime env is setting up.
|
| 16 |
-
[2025-05-26 19:38:02,682] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
|
| 17 |
-
INFO 05-26 19:38:09 [__init__.py:239] Automatically detected platform cuda.
|
| 18 |
-
2025-05-26 19:38:11,252 INFO worker.py:1520 -- Using address 10.140.0.167:6231 set in the environment variable RAY_ADDRESS
|
| 19 |
-
2025-05-26 19:38:11,253 INFO worker.py:1660 -- Connecting to existing Ray cluster at address: 10.140.0.167:6231...
|
| 20 |
-
2025-05-26 19:38:11,274 INFO worker.py:1843 -- Connected to Ray cluster. View the dashboard at [1m[32m10.140.0.167:2983 [39m[22m
|
| 21 |
-
[36m(pid=42922)[0m INFO 05-26 19:38:44 [__init__.py:239] Automatically detected platform cuda.
|
| 22 |
-
[36m(LLMRayActor pid=42923)[0m INFO 05-26 19:39:22 [config.py:585] This model supports multiple tasks: {'score', 'reward', 'generate', 'embed', 'classify'}. Defaulting to 'generate'.
|
| 23 |
-
[36m(LLMRayActor pid=42923)[0m WARNING 05-26 19:39:22 [arg_utils.py:1846] VLLM_ATTENTION_BACKEND=triton is not supported by the V1 Engine. Falling back to V0. We recommend to remove VLLM_ATTENTION_BACKEND=triton from your config in favor of the V1 Engine.
|
| 24 |
-
[36m(LLMRayActor pid=42923)[0m WARNING 05-26 19:39:22 [arg_utils.py:1745] --enable-prefix-caching is not supported for multimodal models in V0 and has been disabled.
|
| 25 |
-
[36m(LLMRayActor pid=42923)[0m INFO 05-26 19:39:22 [llm_engine.py:241] Initializing a V0 LLM engine (v0.8.2.dev76+gf68cce8) with config: model='/mnt/petrelfs/luyiting/ckt/Qwen2.5-VL-7B-Instruct/Qwen2.5-VL-7B-Instruct/', speculative_config=None, tokenizer='/mnt/petrelfs/luyiting/ckt/Qwen2.5-VL-7B-Instruct/Qwen2.5-VL-7B-Instruct/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=43, served_model_name=/mnt/petrelfs/luyiting/ckt/Qwen2.5-VL-7B-Instruct/Qwen2.5-VL-7B-Instruct/, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
|
| 26 |
-
[36m(pid=42927)[0m INFO 05-26 19:38:44 [__init__.py:239] Automatically detected platform cuda.[32m [repeated 7x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)[0m
|
| 27 |
-
[36m(LLMRayActor pid=42928)[0m INFO 05-26 19:39:22 [config.py:585] This model supports multiple tasks: {'generate', 'reward', 'embed', 'classify', 'score'}. Defaulting to 'generate'.
|
| 28 |
-
[36m(LLMRayActor pid=42921)[0m INFO 05-26 19:39:22 [config.py:585] This model supports multiple tasks: {'reward', 'embed', 'generate', 'classify', 'score'}. Defaulting to 'generate'.
|
| 29 |
-
[36m(LLMRayActor pid=42927)[0m INFO 05-26 19:39:22 [config.py:585] This model supports multiple tasks: {'classify', 'embed', 'score', 'reward', 'generate'}. Defaulting to 'generate'.
|
| 30 |
-
[36m(LLMRayActor pid=42924)[0m INFO 05-26 19:39:23 [config.py:585] This model supports multiple tasks: {'score', 'embed', 'generate', 'reward', 'classify'}. Defaulting to 'generate'.
|
| 31 |
-
[36m(LLMRayActor pid=42922)[0m INFO 05-26 19:39:23 [config.py:585] This model supports multiple tasks: {'generate', 'score', 'embed', 'reward', 'classify'}. Defaulting to 'generate'.
|
| 32 |
-
[36m(LLMRayActor pid=42925)[0m INFO 05-26 19:39:23 [config.py:585] This model supports multiple tasks: {'score', 'generate', 'reward', 'classify', 'embed'}. Defaulting to 'generate'.
|
| 33 |
-
[36m(LLMRayActor pid=42926)[0m INFO 05-26 19:39:23 [config.py:585] This model supports multiple tasks: {'reward', 'score', 'classify', 'generate', 'embed'}. Defaulting to 'generate'.
|
| 34 |
-
[36m(LLMRayActor pid=42923)[0m [2025-05-26 19:39:26,028] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
|
| 35 |
-
[36m(LLMRayActor pid=42923)[0m INFO 05-26 19:39:33 [cuda.py:293] Using Flash Attention backend.
|
| 36 |
-
[36m(LLMRayActor pid=42925)[0m WARNING 05-26 19:39:23 [arg_utils.py:1846] VLLM_ATTENTION_BACKEND=triton is not supported by the V1 Engine. Falling back to V0. We recommend to remove VLLM_ATTENTION_BACKEND=triton from your config in favor of the V1 Engine.[32m [repeated 7x across cluster][0m
|
| 37 |
-
[36m(LLMRayActor pid=42925)[0m WARNING 05-26 19:39:23 [arg_utils.py:1745] --enable-prefix-caching is not supported for multimodal models in V0 and has been disabled.[32m [repeated 7x across cluster][0m
|
| 38 |
-
[36m(LLMRayActor pid=42925)[0m INFO 05-26 19:39:23 [llm_engine.py:241] Initializing a V0 LLM engine (v0.8.2.dev76+gf68cce8) with config: model='/mnt/petrelfs/luyiting/ckt/Qwen2.5-VL-7B-Instruct/Qwen2.5-VL-7B-Instruct/', speculative_config=None, tokenizer='/mnt/petrelfs/luyiting/ckt/Qwen2.5-VL-7B-Instruct/Qwen2.5-VL-7B-Instruct/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=44, served_model_name=/mnt/petrelfs/luyiting/ckt/Qwen2.5-VL-7B-Instruct/Qwen2.5-VL-7B-Instruct/, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, [32m [repeated 7x across cluster][0m
|
| 39 |
-
[36m(LLMRayActor pid=42926)[0m [2025-05-26 19:39:26,784] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)[32m [repeated 7x across cluster][0m
|
| 40 |
-
[36m(LLMRayActor pid=42923)[0m INFO 05-26 19:39:36 [parallel_state.py:967] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
|
| 41 |
-
[36m(LLMRayActor pid=42924)[0m INFO 05-26 19:39:36 [model_runner.py:1110] Starting to load model /mnt/petrelfs/luyiting/ckt/Qwen2.5-VL-7B-Instruct/Qwen2.5-VL-7B-Instruct/...
|
| 42 |
-
[36m(LLMRayActor pid=42922)[0m INFO 05-26 19:39:37 [config.py:3229] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256] is overridden by config [256, 128, 2, 1, 4, 136, 8, 144, 16, 152, 24, 160, 32, 168, 40, 176, 48, 184, 56, 192, 64, 200, 72, 208, 80, 216, 88, 120, 224, 96, 232, 104, 240, 112, 248]
|
| 43 |
-
[36m(LLMRayActor pid=42924)[0m
|
| 44 |
-
Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s]
|
| 45 |
-
[36m(LLMRayActor pid=42924)[0m
|
| 46 |
-
Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:00<00:03, 1.07it/s]
|
| 47 |
-
[36m(LLMRayActor pid=42927)[0m
|
| 48 |
-
Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s][32m [repeated 7x across cluster][0m
|
| 49 |
-
[36m(LLMRayActor pid=42924)[0m
|
| 50 |
-
Loading safetensors checkpoint shards: 60% Completed | 3/5 [00:08<00:06, 3.28s/it][32m [repeated 16x across cluster][0m
|
| 51 |
-
[36m(LLMRayActor pid=42924)[0m INFO 05-26 19:39:54 [loader.py:429] Loading weights took 16.67 seconds
|
| 52 |
-
[36m(LLMRayActor pid=42926)[0m INFO 05-26 19:39:33 [cuda.py:293] Using Flash Attention backend.[32m [repeated 7x across cluster][0m
|
| 53 |
-
[36m(LLMRayActor pid=42926)[0m INFO 05-26 19:39:37 [parallel_state.py:967] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0[32m [repeated 7x across cluster][0m
|
| 54 |
-
[36m(LLMRayActor pid=42926)[0m INFO 05-26 19:39:37 [model_runner.py:1110] Starting to load model /mnt/petrelfs/luyiting/ckt/Qwen2.5-VL-7B-Instruct/Qwen2.5-VL-7B-Instruct/...[32m [repeated 7x across cluster][0m
|
| 55 |
-
[36m(LLMRayActor pid=42926)[0m INFO 05-26 19:39:37 [config.py:3229] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256] is overridden by config [256, 128, 2, 1, 4, 136, 8, 144, 16, 152, 24, 160, 32, 168, 40, 176, 48, 184, 56, 192, 64, 200, 72, 208, 80, 216, 88, 120, 224, 96, 232, 104, 240, 112, 248][32m [repeated 7x across cluster][0m
|
| 56 |
-
[36m(LLMRayActor pid=42924)[0m
|
| 57 |
-
[36m(LLMRayActor pid=42924)[0m
|
| 58 |
-
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:16<00:00, 3.30s/it][32m [repeated 17x across cluster][0m
|
| 59 |
-
[36m(LLMRayActor pid=42925)[0m
|
| 60 |
-
[36m(LLMRayActor pid=42922)[0m
|
| 61 |
-
[36m(LLMRayActor pid=42923)[0m
|
| 62 |
-
[36m(LLMRayActor pid=42926)[0m
|
| 63 |
-
[36m(LLMRayActor pid=42921)[0m
|
| 64 |
-
[36m(LLMRayActor pid=42928)[0m
|
| 65 |
-
[36m(LLMRayActor pid=42927)[0m
|
| 66 |
-
[36m(LLMRayActor pid=42924)[0m INFO 05-26 19:39:54 [model_runner.py:1146] Model loading took 15.6271 GB and 17.157056 seconds
|
| 67 |
-
[36m(LLMRayActor pid=42924)[0m WARNING 05-26 19:39:55 [model_runner.py:1296] Computed max_num_seqs (min(256, 8192 // 32768)) to be less than 1. Setting it to the minimum value of 1.
|
| 68 |
-
[36m(LLMRayActor pid=42924)[0m Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
|
| 69 |
-
[36m(LLMRayActor pid=42927)[0m WARNING 05-26 19:40:00 [profiling.py:222] The sequence length used for profiling (max_num_batched_tokens / max_num_seqs = 8192) is too short to hold the multi-modal embeddings in the worst case (32768 tokens in total, out of which {'image': 16384, 'video': 16384} are reserved for multi-modal embeddings). This may cause certain multi-modal inputs to fail during inference, even when the input text is short. To avoid this, you should increase `max_model_len`, reduce `max_num_seqs`, and/or reduce `mm_counts`.
|
| 70 |
-
[36m(LLMRayActor pid=42927)[0m INFO 05-26 19:39:54 [loader.py:429] Loading weights took 16.66 seconds[32m [repeated 7x across cluster][0m
|
| 71 |
-
[36m(LLMRayActor pid=42928)[0m INFO 05-26 19:39:54 [model_runner.py:1146] Model loading took 15.6271 GB and 17.248244 seconds[32m [repeated 7x across cluster][0m
|
| 72 |
-
[36m(LLMRayActor pid=42928)[0m WARNING 05-26 19:39:55 [model_runner.py:1296] Computed max_num_seqs (min(256, 8192 // 32768)) to be less than 1. Setting it to the minimum value of 1.[32m [repeated 7x across cluster][0m
|
| 73 |
-
[36m(LLMRayActor pid=42924)[0m INFO 05-26 19:40:02 [worker.py:267] Memory profiling takes 7.96 seconds
|
| 74 |
-
[36m(LLMRayActor pid=42924)[0m INFO 05-26 19:40:02 [worker.py:267] the current vLLM instance can use total_gpu_memory (79.32GiB) x gpu_memory_utilization (0.50) = 39.66GiB
|
| 75 |
-
[36m(LLMRayActor pid=42924)[0m INFO 05-26 19:40:02 [worker.py:267] model weights take 15.63GiB; non_torch_memory takes 0.21GiB; PyTorch activation peak memory takes 1.09GiB; the rest of the memory reserved for KV Cache is 22.73GiB.
|
| 76 |
-
[36m(LLMRayActor pid=42924)[0m INFO 05-26 19:40:02 [executor_base.py:111] # cuda blocks: 26598, # CPU blocks: 4681
|
| 77 |
-
[36m(LLMRayActor pid=42924)[0m INFO 05-26 19:40:02 [executor_base.py:116] Maximum concurrency for 8192 tokens per request: 51.95x
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
logs/20250526_194456/process_pids.txt
DELETED
|
@@ -1,2 +0,0 @@
|
|
| 1 |
-
Remote RM PID: 11287
|
| 2 |
-
Train PID: 11288
|
|
|
|
|
|
|
|
|
logs/20250526_194456/remote_rm_qa.log
DELETED
|
File without changes
|
logs/20250526_194456/train.log
DELETED
|
@@ -1,92 +0,0 @@
|
|
| 1 |
-
Traceback (most recent call last):
|
| 2 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/urllib3/connection.py", line 198, in _new_conn
|
| 3 |
-
sock = connection.create_connection(
|
| 4 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_connection
|
| 5 |
-
raise err
|
| 6 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/urllib3/util/connection.py", line 73, in create_connection
|
| 7 |
-
sock.connect(sa)
|
| 8 |
-
ConnectionRefusedError: [Errno 111] Connection refused
|
| 9 |
-
|
| 10 |
-
The above exception was the direct cause of the following exception:
|
| 11 |
-
|
| 12 |
-
Traceback (most recent call last):
|
| 13 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen
|
| 14 |
-
response = self._make_request(
|
| 15 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/urllib3/connectionpool.py", line 493, in _make_request
|
| 16 |
-
conn.request(
|
| 17 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/urllib3/connection.py", line 445, in request
|
| 18 |
-
self.endheaders()
|
| 19 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/http/client.py", line 1278, in endheaders
|
| 20 |
-
self._send_output(message_body, encode_chunked=encode_chunked)
|
| 21 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/http/client.py", line 1038, in _send_output
|
| 22 |
-
self.send(msg)
|
| 23 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/http/client.py", line 976, in send
|
| 24 |
-
self.connect()
|
| 25 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/urllib3/connection.py", line 276, in connect
|
| 26 |
-
self.sock = self._new_conn()
|
| 27 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/urllib3/connection.py", line 213, in _new_conn
|
| 28 |
-
raise NewConnectionError(
|
| 29 |
-
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fabc4ee1cf0>: Failed to establish a new connection: [Errno 111] Connection refused
|
| 30 |
-
|
| 31 |
-
The above exception was the direct cause of the following exception:
|
| 32 |
-
|
| 33 |
-
Traceback (most recent call last):
|
| 34 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/requests/adapters.py", line 667, in send
|
| 35 |
-
resp = conn.urlopen(
|
| 36 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/urllib3/connectionpool.py", line 841, in urlopen
|
| 37 |
-
retries = retries.increment(
|
| 38 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/urllib3/util/retry.py", line 519, in increment
|
| 39 |
-
raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type]
|
| 40 |
-
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='127.0.0.1', port=2983): Max retries exceeded with url: /api/version (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fabc4ee1cf0>: Failed to establish a new connection: [Errno 111] Connection refused'))
|
| 41 |
-
|
| 42 |
-
During handling of the above exception, another exception occurred:
|
| 43 |
-
|
| 44 |
-
Traceback (most recent call last):
|
| 45 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 262, in _check_connection_and_version_with_url
|
| 46 |
-
r = self._do_request("GET", url)
|
| 47 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 303, in _do_request
|
| 48 |
-
return requests.request(
|
| 49 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/requests/api.py", line 59, in request
|
| 50 |
-
return session.request(method=method, url=url, **kwargs)
|
| 51 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
|
| 52 |
-
resp = self.send(prep, **send_kwargs)
|
| 53 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
|
| 54 |
-
r = adapter.send(request, **kwargs)
|
| 55 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/requests/adapters.py", line 700, in send
|
| 56 |
-
raise ConnectionError(e, request=request)
|
| 57 |
-
requests.exceptions.ConnectionError: HTTPConnectionPool(host='127.0.0.1', port=2983): Max retries exceeded with url: /api/version (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fabc4ee1cf0>: Failed to establish a new connection: [Errno 111] Connection refused'))
|
| 58 |
-
|
| 59 |
-
During handling of the above exception, another exception occurred:
|
| 60 |
-
|
| 61 |
-
Traceback (most recent call last):
|
| 62 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/bin/ray", line 8, in <module>
|
| 63 |
-
sys.exit(main())
|
| 64 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/ray/scripts/scripts.py", line 2690, in main
|
| 65 |
-
return cli()
|
| 66 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/click/core.py", line 1161, in __call__
|
| 67 |
-
return self.main(*args, **kwargs)
|
| 68 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/click/core.py", line 1082, in main
|
| 69 |
-
rv = self.invoke(ctx)
|
| 70 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/click/core.py", line 1697, in invoke
|
| 71 |
-
return _process_result(sub_ctx.command.invoke(sub_ctx))
|
| 72 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/click/core.py", line 1697, in invoke
|
| 73 |
-
return _process_result(sub_ctx.command.invoke(sub_ctx))
|
| 74 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/click/core.py", line 1443, in invoke
|
| 75 |
-
return ctx.invoke(self.callback, **ctx.params)
|
| 76 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/click/core.py", line 788, in invoke
|
| 77 |
-
return __callback(*args, **kwargs)
|
| 78 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/ray/dashboard/modules/job/cli_utils.py", line 54, in wrapper
|
| 79 |
-
return func(*args, **kwargs)
|
| 80 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/ray/autoscaler/_private/cli_logger.py", line 823, in wrapper
|
| 81 |
-
return f(*args, **kwargs)
|
| 82 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/ray/dashboard/modules/job/cli.py", line 267, in submit
|
| 83 |
-
client = _get_sdk_client(
|
| 84 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/ray/dashboard/modules/job/cli.py", line 32, in _get_sdk_client
|
| 85 |
-
client = JobSubmissionClient(
|
| 86 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/ray/dashboard/modules/job/sdk.py", line 105, in __init__
|
| 87 |
-
self._check_connection_and_version(
|
| 88 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 248, in _check_connection_and_version
|
| 89 |
-
self._check_connection_and_version_with_url(min_version, version_error_message)
|
| 90 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 278, in _check_connection_and_version_with_url
|
| 91 |
-
raise ConnectionError(
|
| 92 |
-
ConnectionError: Failed to connect to Ray at address: http://127.0.0.1:2983.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
logs/20250527_011343/process_pids.txt
DELETED
|
@@ -1,2 +0,0 @@
|
|
| 1 |
-
Remote RM PID: 4686
|
| 2 |
-
Train PID: 4687
|
|
|
|
|
|
|
|
|
logs/20250527_011343/remote_rm_qa.log
DELETED
|
@@ -1,9 +0,0 @@
|
|
| 1 |
-
[2025-05-27 01:14:20,482] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
|
| 2 |
-
load dataset success
|
| 3 |
-
* Serving Flask app 'math_verifier_wolatex'
|
| 4 |
-
* Debug mode: off
|
| 5 |
-
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
|
| 6 |
-
* Running on all addresses (0.0.0.0)
|
| 7 |
-
* Running on http://127.0.0.1:2394
|
| 8 |
-
* Running on http://10.140.0.151:2394
|
| 9 |
-
Press CTRL+C to quit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
logs/20250527_011343/train.log
DELETED
|
@@ -1,45 +0,0 @@
|
|
| 1 |
-
2025-05-27 01:14:05,551 INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_ebceb3f924b2a11c.zip.
|
| 2 |
-
2025-05-27 01:14:05,551 INFO packaging.py:575 -- Creating a file package for local module '/mnt/petrelfs/luyiting/MultiAgentEval/lmm-r1'.
|
| 3 |
-
2025-05-27 01:14:04,421 INFO cli.py:39 -- [37mJob submission server address[39m: [1mhttp://127.0.0.1:2983[22m
|
| 4 |
-
Traceback (most recent call last):
|
| 5 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/bin/ray", line 8, in <module>
|
| 6 |
-
sys.exit(main())
|
| 7 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/ray/scripts/scripts.py", line 2690, in main
|
| 8 |
-
return cli()
|
| 9 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/click/core.py", line 1161, in __call__
|
| 10 |
-
return self.main(*args, **kwargs)
|
| 11 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/click/core.py", line 1082, in main
|
| 12 |
-
rv = self.invoke(ctx)
|
| 13 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/click/core.py", line 1697, in invoke
|
| 14 |
-
return _process_result(sub_ctx.command.invoke(sub_ctx))
|
| 15 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/click/core.py", line 1697, in invoke
|
| 16 |
-
return _process_result(sub_ctx.command.invoke(sub_ctx))
|
| 17 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/click/core.py", line 1443, in invoke
|
| 18 |
-
return ctx.invoke(self.callback, **ctx.params)
|
| 19 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/click/core.py", line 788, in invoke
|
| 20 |
-
return __callback(*args, **kwargs)
|
| 21 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/ray/dashboard/modules/job/cli_utils.py", line 54, in wrapper
|
| 22 |
-
return func(*args, **kwargs)
|
| 23 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/ray/autoscaler/_private/cli_logger.py", line 823, in wrapper
|
| 24 |
-
return f(*args, **kwargs)
|
| 25 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/ray/dashboard/modules/job/cli.py", line 276, in submit
|
| 26 |
-
job_id = client.submit_job(
|
| 27 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/ray/dashboard/modules/job/sdk.py", line 250, in submit_job
|
| 28 |
-
self._raise_error(r)
|
| 29 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 283, in _raise_error
|
| 30 |
-
raise RuntimeError(
|
| 31 |
-
RuntimeError: Request failed with status code 500: Traceback (most recent call last):
|
| 32 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/ray/dashboard/modules/job/job_head.py", line 388, in submit_job
|
| 33 |
-
resp = await job_agent_client.submit_job_internal(submit_request)
|
| 34 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/ray/dashboard/modules/job/job_head.py", line 82, in submit_job_internal
|
| 35 |
-
async with self._session.post(
|
| 36 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/aiohttp/client.py", line 1425, in __aenter__
|
| 37 |
-
self._resp: _RetType = await self._coro
|
| 38 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/aiohttp/client.py", line 730, in _request
|
| 39 |
-
await resp.start(conn)
|
| 40 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/aiohttp/client_reqrep.py", line 1059, in start
|
| 41 |
-
message, payload = await protocol.read() # type: ignore[union-attr]
|
| 42 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/aiohttp/streams.py", line 672, in read
|
| 43 |
-
await self._waiter
|
| 44 |
-
aiohttp.client_exceptions.ServerDisconnectedError: Server disconnected
|
| 45 |
-
.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
logs/20250527_095510/process_pids.txt
DELETED
|
@@ -1,2 +0,0 @@
|
|
| 1 |
-
Remote RM PID: 299008
|
| 2 |
-
Train PID: 299009
|
|
|
|
|
|
|
|
|
logs/20250527_095510/remote_rm_qa.log
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:8687dd357ffe27589cfa1830fb7e278b2da07b36db566391235527667230c5da
|
| 3 |
-
size 68981940
|
|
|
|
|
|
|
|
|
|
|
|
logs/20250527_095510/train.log
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|
logs/20250527_235509/process_pids.txt
DELETED
|
@@ -1,2 +0,0 @@
|
|
| 1 |
-
Remote RM PID: 412160
|
| 2 |
-
Train PID: 412161
|
|
|
|
|
|
|
|
|
logs/20250527_235509/remote_rm_qa.log
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|
logs/20250527_235509/train.log
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|
logs/20250528_110535/process_pids.txt
DELETED
|
@@ -1,2 +0,0 @@
|
|
| 1 |
-
Remote RM PID: 268533
|
| 2 |
-
Train PID: 268534
|
|
|
|
|
|
|
|
|
logs/20250528_110535/remote_rm_qa.log
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:34ca2314ebf41df0ef2e41309a606861f152d7d4f8c45c559336e02c69cdaea1
|
| 3 |
-
size 22480261
|
|
|
|
|
|
|
|
|
|
|
|
logs/20250528_110535/train.log
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|
logs/20250528_161139/process_pids.txt
DELETED
|
@@ -1,2 +0,0 @@
|
|
| 1 |
-
Remote RM PID: 395733
|
| 2 |
-
Train PID: 395734
|
|
|
|
|
|
|
|
|
logs/20250528_161139/remote_rm_qa.log
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:1c3ce9ee9d15d3a606842f7b7fcadd18f5efe8426c15279342e86e14f021a801
|
| 3 |
-
size 59636609
|
|
|
|
|
|
|
|
|
|
|
|
logs/20250528_161139/train.log
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|
logs/20250529_214257/process_pids.txt
DELETED
|
@@ -1,2 +0,0 @@
|
|
| 1 |
-
Remote RM PID: 332906
|
| 2 |
-
Train PID: 332907
|
|
|
|
|
|
|
|
|
logs/20250529_214257/remote_rm_qa.log
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:1a81eeda8d4083e2da026795759cdbf80c7be314480b3705f06293d4cd26cbb4
|
| 3 |
-
size 31232415
|
|
|
|
|
|
|
|
|
|
|
|
logs/20250529_214257/train.log
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|
logs/20250611_110725/process_pids.txt
DELETED
|
@@ -1,2 +0,0 @@
|
|
| 1 |
-
Remote RM PID: 128943
|
| 2 |
-
Train PID: 128944
|
|
|
|
|
|
|
|
|
logs/20250611_110725/remote_rm_qa.log
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:b03d175ee35078ea79f97a809654b6eb1c8f9b7944755da98a05420253009b30
|
| 3 |
-
size 20260314
|
|
|
|
|
|
|
|
|
|
|
|
logs/20250611_110725/train.log
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|
logs/20250611_150946/process_pids.txt
DELETED
|
@@ -1,2 +0,0 @@
|
|
| 1 |
-
Remote RM PID: 328854
|
| 2 |
-
Train PID: 328855
|
|
|
|
|
|
|
|
|
logs/20250611_150946/remote_rm_qa.log
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|
logs/20250611_150946/train.log
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|
logs/20250611_160325/process_pids.txt
DELETED
|
@@ -1,2 +0,0 @@
|
|
| 1 |
-
Remote RM PID: 459479
|
| 2 |
-
Train PID: 459480
|
|
|
|
|
|
|
|
|
logs/20250611_160325/remote_rm_qa.log
DELETED
|
@@ -1,9 +0,0 @@
|
|
| 1 |
-
[2025-06-11 16:04:02,932] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
|
| 2 |
-
load dataset success
|
| 3 |
-
* Serving Flask app 'math_verifier_wolatex'
|
| 4 |
-
* Debug mode: off
|
| 5 |
-
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
|
| 6 |
-
* Running on all addresses (0.0.0.0)
|
| 7 |
-
* Running on http://127.0.0.1:2399
|
| 8 |
-
* Running on http://10.140.1.42:2399
|
| 9 |
-
Press CTRL+C to quit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
logs/20250611_160325/train.log
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|
logs/20250611_161239/process_pids.txt
DELETED
|
@@ -1,2 +0,0 @@
|
|
| 1 |
-
Remote RM PID: 104691
|
| 2 |
-
Train PID: 104692
|
|
|
|
|
|
|
|
|
logs/20250611_161239/remote_rm_qa.log
DELETED
|
@@ -1,9 +0,0 @@
|
|
| 1 |
-
[2025-06-11 16:13:16,143] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
|
| 2 |
-
load dataset success
|
| 3 |
-
* Serving Flask app 'math_verifier_wolatex'
|
| 4 |
-
* Debug mode: off
|
| 5 |
-
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
|
| 6 |
-
* Running on all addresses (0.0.0.0)
|
| 7 |
-
* Running on http://127.0.0.1:2399
|
| 8 |
-
* Running on http://10.140.1.48:2399
|
| 9 |
-
Press CTRL+C to quit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
logs/20250611_161239/train.log
DELETED
|
@@ -1,188 +0,0 @@
|
|
| 1 |
-
2025-06-11 16:13:02,348 INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_e427a4376bfc803e.zip.
|
| 2 |
-
2025-06-11 16:13:02,349 INFO packaging.py:575 -- Creating a file package for local module '/mnt/petrelfs/luyiting/MultiAgentEval/lmm-r1'.
|
| 3 |
-
2025-06-11 16:13:00,858 INFO cli.py:39 -- [37mJob submission server address[39m: [1mhttp://127.0.0.1:2989[22m
|
| 4 |
-
2025-06-11 16:13:08,059 SUCC cli.py:63 -- [32m-------------------------------------------------------[39m
|
| 5 |
-
2025-06-11 16:13:08,059 SUCC cli.py:64 -- [32mJob 'raysubmit_aazeFP8fmRtZyntC' submitted successfully[39m
|
| 6 |
-
2025-06-11 16:13:08,059 SUCC cli.py:65 -- [32m-------------------------------------------------------[39m
|
| 7 |
-
2025-06-11 16:13:08,059 INFO cli.py:289 -- [36mNext steps[39m
|
| 8 |
-
2025-06-11 16:13:08,060 INFO cli.py:290 -- Query the logs of the job:
|
| 9 |
-
2025-06-11 16:13:08,060 INFO cli.py:292 -- [1mray job logs raysubmit_aazeFP8fmRtZyntC[22m
|
| 10 |
-
2025-06-11 16:13:08,060 INFO cli.py:294 -- Query the status of the job:
|
| 11 |
-
2025-06-11 16:13:08,060 INFO cli.py:296 -- [1mray job status raysubmit_aazeFP8fmRtZyntC[22m
|
| 12 |
-
2025-06-11 16:13:08,060 INFO cli.py:298 -- Request the job to be stopped:
|
| 13 |
-
2025-06-11 16:13:08,060 INFO cli.py:300 -- [1mray job stop raysubmit_aazeFP8fmRtZyntC[22m
|
| 14 |
-
2025-06-11 16:13:08,062 INFO cli.py:307 -- Tailing logs until the job exits (disable with --no-wait):
|
| 15 |
-
2025-06-11 16:13:07,524 INFO job_manager.py:531 -- Runtime env is setting up.
|
| 16 |
-
[2025-06-11 16:13:32,045] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
|
| 17 |
-
INFO 06-11 16:13:37 [__init__.py:239] Automatically detected platform cuda.
|
| 18 |
-
2025-06-11 16:13:38,183 INFO worker.py:1520 -- Using address 10.140.1.48:6239 set in the environment variable RAY_ADDRESS
|
| 19 |
-
2025-06-11 16:13:38,184 INFO worker.py:1660 -- Connecting to existing Ray cluster at address: 10.140.1.48:6239...
|
| 20 |
-
2025-06-11 16:13:38,204 INFO worker.py:1843 -- Connected to Ray cluster. View the dashboard at [1m[32m10.140.1.48:2989 [39m[22m
|
| 21 |
-
[36m(pid=117690)[0m INFO 06-11 16:14:04 [__init__.py:239] Automatically detected platform cuda.
|
| 22 |
-
[36m(LLMRayActor pid=117685)[0m INFO 06-11 16:14:34 [config.py:585] This model supports multiple tasks: {'reward', 'score', 'embed', 'classify', 'generate'}. Defaulting to 'generate'.
|
| 23 |
-
[36m(LLMRayActor pid=117685)[0m WARNING 06-11 16:14:34 [arg_utils.py:1846] VLLM_ATTENTION_BACKEND=triton is not supported by the V1 Engine. Falling back to V0. We recommend to remove VLLM_ATTENTION_BACKEND=triton from your config in favor of the V1 Engine.
|
| 24 |
-
[36m(LLMRayActor pid=117685)[0m WARNING 06-11 16:14:34 [arg_utils.py:1745] --enable-prefix-caching is not supported for multimodal models in V0 and has been disabled.
|
| 25 |
-
[36m(LLMRayActor pid=117685)[0m INFO 06-11 16:14:34 [llm_engine.py:241] Initializing a V0 LLM engine (v0.8.2.dev76+gf68cce8) with config: model='/mnt/petrelfs/luyiting/ckt/Qwen2.5-VL-7B-Instruct/Qwen2.5-VL-7B-Instruct/', speculative_config=None, tokenizer='/mnt/petrelfs/luyiting/ckt/Qwen2.5-VL-7B-Instruct/Qwen2.5-VL-7B-Instruct/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=42, served_model_name=/mnt/petrelfs/luyiting/ckt/Qwen2.5-VL-7B-Instruct/Qwen2.5-VL-7B-Instruct/, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
|
| 26 |
-
[36m(pid=117689)[0m INFO 06-11 16:14:04 [__init__.py:239] Automatically detected platform cuda.[32m [repeated 7x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)[0m
|
| 27 |
-
[36m(LLMRayActor pid=117691)[0m INFO 06-11 16:14:34 [config.py:585] This model supports multiple tasks: {'classify', 'score', 'generate', 'embed', 'reward'}. Defaulting to 'generate'.
|
| 28 |
-
[36m(LLMRayActor pid=117686)[0m INFO 06-11 16:14:34 [config.py:585] This model supports multiple tasks: {'embed', 'generate', 'score', 'classify', 'reward'}. Defaulting to 'generate'.
|
| 29 |
-
[36m(LLMRayActor pid=117689)[0m INFO 06-11 16:14:34 [config.py:585] This model supports multiple tasks: {'reward', 'score', 'generate', 'classify', 'embed'}. Defaulting to 'generate'.
|
| 30 |
-
[36m(LLMRayActor pid=117690)[0m INFO 06-11 16:14:34 [config.py:585] This model supports multiple tasks: {'reward', 'generate', 'score', 'embed', 'classify'}. Defaulting to 'generate'.
|
| 31 |
-
[36m(LLMRayActor pid=117688)[0m INFO 06-11 16:14:34 [config.py:585] This model supports multiple tasks: {'generate', 'reward', 'embed', 'score', 'classify'}. Defaulting to 'generate'.
|
| 32 |
-
[36m(LLMRayActor pid=117687)[0m INFO 06-11 16:14:34 [config.py:585] This model supports multiple tasks: {'embed', 'reward', 'score', 'generate', 'classify'}. Defaulting to 'generate'.
|
| 33 |
-
[36m(LLMRayActor pid=117692)[0m INFO 06-11 16:14:34 [config.py:585] This model supports multiple tasks: {'classify', 'reward', 'generate', 'embed', 'score'}. Defaulting to 'generate'.
|
| 34 |
-
[36m(LLMRayActor pid=117685)[0m [2025-06-11 16:14:37,962] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
|
| 35 |
-
[36m(LLMRayActor pid=117685)[0m INFO 06-11 16:14:43 [cuda.py:293] Using Flash Attention backend.
|
| 36 |
-
[36m(LLMRayActor pid=117692)[0m WARNING 06-11 16:14:34 [arg_utils.py:1846] VLLM_ATTENTION_BACKEND=triton is not supported by the V1 Engine. Falling back to V0. We recommend to remove VLLM_ATTENTION_BACKEND=triton from your config in favor of the V1 Engine.[32m [repeated 7x across cluster][0m
|
| 37 |
-
[36m(LLMRayActor pid=117692)[0m WARNING 06-11 16:14:34 [arg_utils.py:1745] --enable-prefix-caching is not supported for multimodal models in V0 and has been disabled.[32m [repeated 7x across cluster][0m
|
| 38 |
-
[36m(LLMRayActor pid=117692)[0m INFO 06-11 16:14:34 [llm_engine.py:241] Initializing a V0 LLM engine (v0.8.2.dev76+gf68cce8) with config: model='/mnt/petrelfs/luyiting/ckt/Qwen2.5-VL-7B-Instruct/Qwen2.5-VL-7B-Instruct/', speculative_config=None, tokenizer='/mnt/petrelfs/luyiting/ckt/Qwen2.5-VL-7B-Instruct/Qwen2.5-VL-7B-Instruct/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=45, served_model_name=/mnt/petrelfs/luyiting/ckt/Qwen2.5-VL-7B-Instruct/Qwen2.5-VL-7B-Instruct/, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, [32m [repeated 7x across cluster][0m
|
| 39 |
-
[36m(LLMRayActor pid=117692)[0m [2025-06-11 16:14:37,973] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)[32m [repeated 7x across cluster][0m
|
| 40 |
-
[36m(LLMRayActor pid=117685)[0m INFO 06-11 16:14:47 [parallel_state.py:967] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
|
| 41 |
-
[36m(LLMRayActor pid=117685)[0m INFO 06-11 16:14:47 [model_runner.py:1110] Starting to load model /mnt/petrelfs/luyiting/ckt/Qwen2.5-VL-7B-Instruct/Qwen2.5-VL-7B-Instruct/...
|
| 42 |
-
[36m(LLMRayActor pid=117685)[0m INFO 06-11 16:14:48 [config.py:3229] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256] is overridden by config [256, 128, 2, 1, 4, 136, 8, 144, 16, 152, 24, 160, 32, 168, 40, 176, 48, 184, 56, 192, 64, 200, 72, 208, 80, 216, 88, 120, 224, 96, 232, 104, 240, 112, 248]
|
| 43 |
-
[36m(LLMRayActor pid=117692)[0m
|
| 44 |
-
Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s]
|
| 45 |
-
[36m(LLMRayActor pid=117692)[0m INFO 06-11 16:14:43 [cuda.py:293] Using Flash Attention backend.[32m [repeated 7x across cluster][0m
|
| 46 |
-
[36m(LLMRayActor pid=117685)[0m
|
| 47 |
-
Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:00<00:03, 1.24it/s]
|
| 48 |
-
[36m(LLMRayActor pid=117689)[0m
|
| 49 |
-
Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s][32m [repeated 7x across cluster][0m
|
| 50 |
-
[36m(LLMRayActor pid=117685)[0m
|
| 51 |
-
Loading safetensors checkpoint shards: 60% Completed | 3/5 [00:08<00:06, 3.04s/it][32m [repeated 16x across cluster][0m
|
| 52 |
-
[36m(LLMRayActor pid=117685)[0m INFO 06-11 16:15:04 [loader.py:429] Loading weights took 15.74 seconds
|
| 53 |
-
[36m(LLMRayActor pid=117689)[0m INFO 06-11 16:14:47 [parallel_state.py:967] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0[32m [repeated 7x across cluster][0m
|
| 54 |
-
[36m(LLMRayActor pid=117689)[0m INFO 06-11 16:14:47 [model_runner.py:1110] Starting to load model /mnt/petrelfs/luyiting/ckt/Qwen2.5-VL-7B-Instruct/Qwen2.5-VL-7B-Instruct/...[32m [repeated 7x across cluster][0m
|
| 55 |
-
[36m(LLMRayActor pid=117690)[0m INFO 06-11 16:14:48 [config.py:3229] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256] is overridden by config [256, 128, 2, 1, 4, 136, 8, 144, 16, 152, 24, 160, 32, 168, 40, 176, 48, 184, 56, 192, 64, 200, 72, 208, 80, 216, 88, 120, 224, 96, 232, 104, 240, 112, 248][32m [repeated 7x across cluster][0m
|
| 56 |
-
[36m(LLMRayActor pid=117685)[0m
|
| 57 |
-
[36m(LLMRayActor pid=117685)[0m
|
| 58 |
-
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:15<00:00, 3.12s/it][32m [repeated 17x across cluster][0m
|
| 59 |
-
[36m(LLMRayActor pid=117690)[0m
|
| 60 |
-
[36m(LLMRayActor pid=117687)[0m
|
| 61 |
-
[36m(LLMRayActor pid=117691)[0m
|
| 62 |
-
[36m(LLMRayActor pid=117689)[0m
|
| 63 |
-
[36m(LLMRayActor pid=117692)[0m
|
| 64 |
-
[36m(LLMRayActor pid=117686)[0m
|
| 65 |
-
[36m(LLMRayActor pid=117688)[0m
|
| 66 |
-
[36m(LLMRayActor pid=117692)[0m INFO 06-11 16:15:04 [model_runner.py:1146] Model loading took 15.6271 GB and 16.982959 seconds
|
| 67 |
-
[36m(LLMRayActor pid=117692)[0m Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
|
| 68 |
-
[36m(LLMRayActor pid=117692)[0m WARNING 06-11 16:15:05 [model_runner.py:1296] Computed max_num_seqs (min(256, 8192 // 32768)) to be less than 1. Setting it to the minimum value of 1.
|
| 69 |
-
[36m(LLMRayActor pid=117685)[0m WARNING 06-11 16:15:10 [profiling.py:222] The sequence length used for profiling (max_num_batched_tokens / max_num_seqs = 8192) is too short to hold the multi-modal embeddings in the worst case (32768 tokens in total, out of which {'image': 16384, 'video': 16384} are reserved for multi-modal embeddings). This may cause certain multi-modal inputs to fail during inference, even when the input text is short. To avoid this, you should increase `max_model_len`, reduce `max_num_seqs`, and/or reduce `mm_counts`.
|
| 70 |
-
[36m(LLMRayActor pid=117692)[0m INFO 06-11 16:15:04 [loader.py:429] Loading weights took 15.90 seconds[32m [repeated 7x across cluster][0m
|
| 71 |
-
[36m(LLMRayActor pid=117687)[0m INFO 06-11 16:15:04 [model_runner.py:1146] Model loading took 15.6271 GB and 17.057258 seconds[32m [repeated 7x across cluster][0m
|
| 72 |
-
[36m(LLMRayActor pid=117687)[0m WARNING 06-11 16:15:05 [model_runner.py:1296] Computed max_num_seqs (min(256, 8192 // 32768)) to be less than 1. Setting it to the minimum value of 1.[32m [repeated 7x across cluster][0m
|
| 73 |
-
[36m(LLMRayActor pid=117685)[0m INFO 06-11 16:15:12 [worker.py:267] Memory profiling takes 8.24 seconds
|
| 74 |
-
[36m(LLMRayActor pid=117685)[0m INFO 06-11 16:15:12 [worker.py:267] the current vLLM instance can use total_gpu_memory (79.32GiB) x gpu_memory_utilization (0.50) = 39.66GiB
|
| 75 |
-
[36m(LLMRayActor pid=117685)[0m INFO 06-11 16:15:12 [worker.py:267] model weights take 15.63GiB; non_torch_memory takes 0.21GiB; PyTorch activation peak memory takes 1.09GiB; the rest of the memory reserved for KV Cache is 22.73GiB.
|
| 76 |
-
[36m(LLMRayActor pid=117685)[0m INFO 06-11 16:15:13 [executor_base.py:111] # cuda blocks: 26598, # CPU blocks: 4681
|
| 77 |
-
[36m(LLMRayActor pid=117685)[0m INFO 06-11 16:15:13 [executor_base.py:116] Maximum concurrency for 8192 tokens per request: 51.95x
|
| 78 |
-
[36m(LLMRayActor pid=117688)[0m CUDA Error: out of memory at /mnt/petrelfs/luyiting/MultiAgentEval/vllm/csrc/cumem_allocator.cpp:62
|
| 79 |
-
[36m(LLMRayActor pid=117688)[0m
|
| 80 |
-
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:15<00:00, 3.12s/it][32m [repeated 14x across cluster][0m
|
| 81 |
-
[36m(LLMRayActor pid=117688)[0m Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.[32m [repeated 7x across cluster][0m
|
| 82 |
-
Traceback (most recent call last):
|
| 83 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/runpy.py", line 196, in _run_module_as_main
|
| 84 |
-
return _run_code(code, main_globals, None,
|
| 85 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/runpy.py", line 86, in _run_code
|
| 86 |
-
exec(code, run_globals)
|
| 87 |
-
File "/mnt/petrelfs/luyiting/tmp_ray/session_2025-06-11_16-12-48_168623_103703/runtime_resources/working_dir_files/_ray_pkg_e427a4376bfc803e/openrlhf/cli/train_ppo_ray.py", line 497, in <module>
|
| 88 |
-
train(args)
|
| 89 |
-
File "/mnt/petrelfs/luyiting/tmp_ray/session_2025-06-11_16-12-48_168623_103703/runtime_resources/working_dir_files/_ray_pkg_e427a4376bfc803e/openrlhf/cli/train_ppo_ray.py", line 86, in train
|
| 90 |
-
vllm_engines = create_vllm_engines(
|
| 91 |
-
File "/mnt/petrelfs/luyiting/tmp_ray/session_2025-06-11_16-12-48_168623_103703/runtime_resources/working_dir_files/_ray_pkg_e427a4376bfc803e/openrlhf/trainer/ray/vllm_engine.py", line 189, in create_vllm_engines
|
| 92 |
-
batch_vllm_engine_call(vllm_engines, "sleep", rank_0_only=False)
|
| 93 |
-
File "/mnt/petrelfs/luyiting/tmp_ray/session_2025-06-11_16-12-48_168623_103703/runtime_resources/working_dir_files/_ray_pkg_e427a4376bfc803e/openrlhf/trainer/ray/vllm_engine.py", line 216, in batch_vllm_engine_call
|
| 94 |
-
return ray.get(refs)
|
| 95 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
|
| 96 |
-
return fn(*args, **kwargs)
|
| 97 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
|
| 98 |
-
return func(*args, **kwargs)
|
| 99 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/ray/_private/worker.py", line 2782, in get
|
| 100 |
-
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
|
| 101 |
-
File "/mnt/petrelfs/luyiting/anaconda3/envs/lmmr1/lib/python3.10/site-packages/ray/_private/worker.py", line 931, in get_objects
|
| 102 |
-
raise value
|
| 103 |
-
ray.exceptions.ActorDiedError: The actor died because of an error raised in its creation task, [36mray::LLMRayActor.__init__()[39m (pid=117688, ip=10.140.1.48, actor_id=ac9437056a85b29812c677da02000000, repr=<openrlhf.trainer.ray.vllm_engine.LLMRayActor object at 0x7f59ee72e170>)
|
| 104 |
-
File "/mnt/petrelfs/luyiting/tmp_ray/session_2025-06-11_16-12-48_168623_103703/runtime_resources/working_dir_files/_ray_pkg_e427a4376bfc803e/openrlhf/trainer/ray/vllm_engine.py", line 54, in __init__
|
| 105 |
-
self.llm = LLM(*args, **kwargs)
|
| 106 |
-
File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/utils.py", line 1037, in inner
|
| 107 |
-
return fn(*args, **kwargs)
|
| 108 |
-
File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/entrypoints/llm.py", line 243, in __init__
|
| 109 |
-
self.llm_engine = LLMEngine.from_engine_args(
|
| 110 |
-
File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/engine/llm_engine.py", line 520, in from_engine_args
|
| 111 |
-
return engine_cls.from_vllm_config(
|
| 112 |
-
File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/engine/llm_engine.py", line 496, in from_vllm_config
|
| 113 |
-
return cls(
|
| 114 |
-
File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/engine/llm_engine.py", line 283, in __init__
|
| 115 |
-
self._initialize_kv_caches()
|
| 116 |
-
File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/engine/llm_engine.py", line 445, in _initialize_kv_caches
|
| 117 |
-
self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
|
| 118 |
-
File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/executor/executor_base.py", line 122, in initialize_cache
|
| 119 |
-
self.collective_rpc("initialize_cache",
|
| 120 |
-
File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
|
| 121 |
-
answer = run_method(self.driver_worker, method, args, kwargs)
|
| 122 |
-
File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/utils.py", line 2255, in run_method
|
| 123 |
-
return func(*args, **kwargs)
|
| 124 |
-
File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/worker/worker.py", line 307, in initialize_cache
|
| 125 |
-
self._init_cache_engine()
|
| 126 |
-
File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/worker/worker.py", line 312, in _init_cache_engine
|
| 127 |
-
self.cache_engine = [
|
| 128 |
-
File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/worker/worker.py", line 313, in <listcomp>
|
| 129 |
-
CacheEngine(self.cache_config, self.model_config,
|
| 130 |
-
File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/worker/cache_engine.py", line 64, in __init__
|
| 131 |
-
self.gpu_cache = self._allocate_kv_cache(
|
| 132 |
-
File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/worker/cache_engine.py", line 83, in _allocate_kv_cache
|
| 133 |
-
layer_kv_cache = torch.zeros(kv_cache_shape,
|
| 134 |
-
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 832.00 MiB. GPU 0 has a total capacity of 79.32 GiB of which 611.56 MiB is free. Process 71866 has 48.06 GiB memory in use. Including non-PyTorch memory, this process has 30.66 GiB memory in use. Of the allocated memory 29.60 GiB is allocated by PyTorch, with 159.90 MiB allocated in private pools (e.g., CUDA Graphs), and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
|
| 135 |
-
[36m(LLMRayActor pid=117688)[0m Exception raised in creation task: The actor died because of an error raised in its creation task, [36mray::LLMRayActor.__init__()[39m (pid=117688, ip=10.140.1.48, actor_id=ac9437056a85b29812c677da02000000, repr=<openrlhf.trainer.ray.vllm_engine.LLMRayActor object at 0x7f59ee72e170>)
|
| 136 |
-
[36m(LLMRayActor pid=117688)[0m File "/mnt/petrelfs/luyiting/tmp_ray/session_2025-06-11_16-12-48_168623_103703/runtime_resources/working_dir_files/_ray_pkg_e427a4376bfc803e/openrlhf/trainer/ray/vllm_engine.py", line 54, in __init__
|
| 137 |
-
[36m(LLMRayActor pid=117688)[0m self.llm = LLM(*args, **kwargs)
|
| 138 |
-
[36m(LLMRayActor pid=117688)[0m File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/utils.py", line 1037, in inner
|
| 139 |
-
[36m(LLMRayActor pid=117688)[0m return fn(*args, **kwargs)
|
| 140 |
-
[36m(LLMRayActor pid=117688)[0m File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/entrypoints/llm.py", line 243, in __init__
|
| 141 |
-
[36m(LLMRayActor pid=117688)[0m self.llm_engine = LLMEngine.from_engine_args(
|
| 142 |
-
[36m(LLMRayActor pid=117688)[0m File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/engine/llm_engine.py", line 520, in from_engine_args
|
| 143 |
-
[36m(LLMRayActor pid=117688)[0m return engine_cls.from_vllm_config(
|
| 144 |
-
[36m(LLMRayActor pid=117688)[0m File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/engine/llm_engine.py", line 496, in from_vllm_config
|
| 145 |
-
[36m(LLMRayActor pid=117688)[0m return cls(
|
| 146 |
-
[36m(LLMRayActor pid=117688)[0m File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/engine/llm_engine.py", line 283, in __init__
|
| 147 |
-
[36m(LLMRayActor pid=117688)[0m self._initialize_kv_caches()
|
| 148 |
-
[36m(LLMRayActor pid=117688)[0m File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/engine/llm_engine.py", line 445, in _initialize_kv_caches
|
| 149 |
-
[36m(LLMRayActor pid=117688)[0m self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
|
| 150 |
-
[36m(LLMRayActor pid=117688)[0m File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/executor/executor_base.py", line 122, in initialize_cache
|
| 151 |
-
[36m(LLMRayActor pid=117688)[0m self.collective_rpc("initialize_cache",
|
| 152 |
-
[36m(LLMRayActor pid=117688)[0m File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
|
| 153 |
-
[36m(LLMRayActor pid=117688)[0m answer = run_method(self.driver_worker, method, args, kwargs)
|
| 154 |
-
[36m(LLMRayActor pid=117688)[0m File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/utils.py", line 2255, in run_method
|
| 155 |
-
[36m(LLMRayActor pid=117688)[0m return func(*args, **kwargs)
|
| 156 |
-
[36m(LLMRayActor pid=117688)[0m File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/worker/worker.py", line 307, in initialize_cache
|
| 157 |
-
[36m(LLMRayActor pid=117688)[0m self._init_cache_engine()
|
| 158 |
-
[36m(LLMRayActor pid=117688)[0m File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/worker/worker.py", line 312, in _init_cache_engine
|
| 159 |
-
[36m(LLMRayActor pid=117688)[0m self.cache_engine = [
|
| 160 |
-
[36m(LLMRayActor pid=117688)[0m File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/worker/worker.py", line 313, in <listcomp>
|
| 161 |
-
[36m(LLMRayActor pid=117688)[0m CacheEngine(self.cache_config, self.model_config,
|
| 162 |
-
[36m(LLMRayActor pid=117688)[0m File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/worker/cache_engine.py", line 64, in __init__
|
| 163 |
-
[36m(LLMRayActor pid=117688)[0m self.gpu_cache = self._allocate_kv_cache(
|
| 164 |
-
[36m(LLMRayActor pid=117688)[0m File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/worker/cache_engine.py", line 83, in _allocate_kv_cache
|
| 165 |
-
[36m(LLMRayActor pid=117688)[0m layer_kv_cache = torch.zeros(kv_cache_shape,
|
| 166 |
-
[36m(LLMRayActor pid=117688)[0m torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 832.00 MiB. GPU 0 has a total capacity of 79.32 GiB of which 611.56 MiB is free. Process 71866 has 48.06 GiB memory in use. Including non-PyTorch memory, this process has 30.66 GiB memory in use. Of the allocated memory 29.60 GiB is allocated by PyTorch, with 159.90 MiB allocated in private pools (e.g., CUDA Graphs), and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
|
| 167 |
-
[36m(LLMRayActor pid=117687)[0m WARNING 06-11 16:15:11 [profiling.py:222] The sequence length used for profiling (max_num_batched_tokens / max_num_seqs = 8192) is too short to hold the multi-modal embeddings in the worst case (32768 tokens in total, out of which {'image': 16384, 'video': 16384} are reserved for multi-modal embeddings). This may cause certain multi-modal inputs to fail during inference, even when the input text is short. To avoid this, you should increase `max_model_len`, reduce `max_num_seqs`, and/or reduce `mm_counts`.[32m [repeated 7x across cluster][0m
|
| 168 |
-
[36m(LLMRayActor pid=117687)[0m INFO 06-11 16:15:13 [worker.py:267] Memory profiling takes 8.37 seconds[32m [repeated 7x across cluster][0m
|
| 169 |
-
[36m(LLMRayActor pid=117687)[0m INFO 06-11 16:15:13 [worker.py:267] the current vLLM instance can use total_gpu_memory (79.32GiB) x gpu_memory_utilization (0.50) = 39.66GiB[32m [repeated 7x across cluster][0m
|
| 170 |
-
[36m(LLMRayActor pid=117687)[0m INFO 06-11 16:15:13 [worker.py:267] model weights take 15.63GiB; non_torch_memory takes 0.21GiB; PyTorch activation peak memory takes 1.09GiB; the rest of the memory reserved for KV Cache is 22.73GiB.[32m [repeated 7x across cluster][0m
|
| 171 |
-
[36m(LLMRayActor pid=117687)[0m INFO 06-11 16:15:13 [executor_base.py:111] # cuda blocks: 26598, # CPU blocks: 4681[32m [repeated 7x across cluster][0m
|
| 172 |
-
[36m(LLMRayActor pid=117687)[0m INFO 06-11 16:15:13 [executor_base.py:116] Maximum concurrency for 8192 tokens per request: 51.95x[32m [repeated 7x across cluster][0m
|
| 173 |
-
[36m(LLMRayActor pid=117689)[0m CUDA Error: out of memory at /mnt/petrelfs/luyiting/MultiAgentEval/vllm/csrc/cumem_allocator.cpp:62[32m [repeated 2x across cluster][0m
|
| 174 |
-
2025-06-11 16:15:19,325 ERR cli.py:71 -- [31m---------------------------------------[39m
|
| 175 |
-
2025-06-11 16:15:19,326 ERR cli.py:72 -- [31mJob 'raysubmit_aazeFP8fmRtZyntC' failed[39m
|
| 176 |
-
2025-06-11 16:15:19,326 ERR cli.py:73 -- [31m---------------------------------------[39m
|
| 177 |
-
2025-06-11 16:15:19,326 INFO cli.py:86 -- Status message: Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars):
|
| 178 |
-
[36m(LLMRayActor pid=117688)[0m File "/mnt/petrelfs/luyiting/MultiAgentEval/vllm/vllm/worker/cache_engine.py", line 83, in _allocate_kv_cache
|
| 179 |
-
[36m(LLMRayActor pid=117688)[0m layer_kv_cache = torch.zeros(kv_cache_shape,
|
| 180 |
-
[36m(LLMRayActor pid=117688)[0m torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 832.00 MiB. GPU 0 has a total capacity of 79.32 GiB of which 611.56 MiB is free. Process 71866 has 48.06 GiB memory in use. Including non-PyTorch memory, this process has 30.66 GiB memory in use. Of the allocated memory 29.60 GiB is allocated by PyTorch, with 159.90 MiB allocated in private pools (e.g., CUDA Graphs), and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
|
| 181 |
-
[36m(LLMRayActor pid=117687)[0m WARNING 06-11 16:15:11 [profiling.py:222] The sequence length used for profiling (max_num_batched_tokens / max_num_seqs = 8192) is too short to hold the multi-modal embeddings in the worst case (32768 tokens in total, out of which {'image': 16384, 'video': 16384} are reserved for multi-modal embeddings). This may cause certain multi-modal inputs to fail during inference, even when the input text is short. To avoid this, you should increase `max_model_len`, reduce `max_num_seqs`, and/or reduce `mm_counts`.[32m [repeated 7x across cluster][0m
|
| 182 |
-
[36m(LLMRayActor pid=117687)[0m INFO 06-11 16:15:13 [worker.py:267] Memory profiling takes 8.37 seconds[32m [repeated 7x across cluster][0m
|
| 183 |
-
[36m(LLMRayActor pid=117687)[0m INFO 06-11 16:15:13 [worker.py:267] the current vLLM instance can use total_gpu_memory (79.32GiB) x gpu_memory_utilization (0.50) = 39.66GiB[32m [repeated 7x across cluster][0m
|
| 184 |
-
[36m(LLMRayActor pid=117687)[0m INFO 06-11 16:15:13 [worker.py:267] model weights take 15.63GiB; non_torch_memory takes 0.21GiB; PyTorch activation peak memory takes 1.09GiB; the rest of the memory reserved for KV Cache is 22.73GiB.[32m [repeated 7x across cluster][0m
|
| 185 |
-
[36m(LLMRayActor pid=117687)[0m INFO 06-11 16:15:13 [executor_base.py:111] # cuda blocks: 26598, # CPU blocks: 4681[32m [repeated 7x across cluster][0m
|
| 186 |
-
[36m(LLMRayActor pid=117687)[0m INFO 06-11 16:15:13 [executor_base.py:116] Maximum concurrency for 8192 tokens per request: 51.95x[32m [repeated 7x across cluster][0m
|
| 187 |
-
[36m(LLMRayActor pid=117689)[0m CUDA Error: out of memory at /mnt/petrelfs/luyiting/MultiAgentEval/vllm/csrc/cumem_allocator.cpp:62[32m [repeated 2x across cluster][0m
|
| 188 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
logs/20250611_162203/process_pids.txt
DELETED
|
@@ -1,2 +0,0 @@
|
|
| 1 |
-
Remote RM PID: 74069
|
| 2 |
-
Train PID: 74070
|
|
|
|
|
|
|
|
|
logs/20250611_162203/remote_rm_qa.log
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|