nohup: ignoring input WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** Running on 8 GPU(s) Using attn_implementation: target=flash_attention_2, draft=sdpa Loading target model: /workspace/models/Qwen3-8B `torch_dtype` is deprecated! Use `dtype` instead! `torch_dtype` is deprecated! Use `dtype` instead! `torch_dtype` is deprecated! Use `dtype` instead! Loading checkpoint shards: 0%| | 0/5 [00:00 > >) + 0x1d2 (0x7efffd3738d2 in /workspace/miniconda3/envs/dflash/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7efffd37a313 in /workspace/miniconda3/envs/dflash/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7efffd37c6fc in /workspace/miniconda3/envs/dflash/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdf0e6 (0x7f004e5e40e6 in /workspace/miniconda3/envs/dflash/bin/../lib/libstdc++.so.6) frame #5: + 0x891f5 (0x7f0050ee61f5 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x1098dc (0x7f0050f668dc in /usr/lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG 0 (default_pg) Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=600000) ran for 600074 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f004b0cbf86 in /workspace/miniconda3/envs/dflash/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x1d2 (0x7efffd3738d2 in /workspace/miniconda3/envs/dflash/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7efffd37a313 in /workspace/miniconda3/envs/dflash/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7efffd37c6fc in /workspace/miniconda3/envs/dflash/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdf0e6 (0x7f004e5e40e6 in /workspace/miniconda3/envs/dflash/bin/../lib/libstdc++.so.6) frame #5: + 0x891f5 (0x7f0050ee61f5 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x1098dc (0x7f0050f668dc in /usr/lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f004b0cbf86 in /workspace/miniconda3/envs/dflash/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: + 0xe5aa84 (0x7efffd005a84 in /workspace/miniconda3/envs/dflash/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0xdf0e6 (0x7f004e5e40e6 in /workspace/miniconda3/envs/dflash/bin/../lib/libstdc++.so.6) frame #3: + 0x891f5 (0x7f0050ee61f5 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x1098dc (0x7f0050f668dc in /usr/lib/x86_64-linux-gnu/libc.so.6) W0319 15:53:49.559000 140004854380352 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 4318 closing signal SIGTERM W0319 15:53:49.560000 140004854380352 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 4319 closing signal SIGTERM W0319 15:53:49.560000 140004854380352 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 4320 closing signal SIGTERM W0319 15:53:49.561000 140004854380352 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 4321 closing signal SIGTERM W0319 15:53:49.562000 140004854380352 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 4322 closing signal SIGTERM W0319 15:53:49.562000 140004854380352 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 4323 closing signal SIGTERM W0319 15:53:49.563000 140004854380352 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 4324 closing signal SIGTERM E0319 15:53:52.882000 140004854380352 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 7 (pid: 4325) of binary: /workspace/miniconda3/envs/dflash/bin/python3 Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/workspace/miniconda3/envs/dflash/lib/python3.11/site-packages/torch/distributed/run.py", line 905, in main() File "/workspace/miniconda3/envs/dflash/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/workspace/miniconda3/envs/dflash/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main run(args) File "/workspace/miniconda3/envs/dflash/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/workspace/miniconda3/envs/dflash/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/miniconda3/envs/dflash/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /workspace/hanrui/syxin_old/eval_dflash_lora_inject.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2026-03-19_15:53:49 host : job-006ce80a7c47-20260302193512-5dcd4c9bbd-lkk5h rank : 7 (local_rank: 7) exitcode : -6 (pid: 4325) error_file: traceback : Signal 6 (SIGABRT) received by PID 4325 ============================================================