nohup: ignoring input ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** Set TORCH_CUDA_ARCH_LIST to 9.0 Set TORCH_CUDA_ARCH_LIST to 9.0 /workspace/hanrui/syxin_old/Specforge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend. warnings.warn( /workspace/hanrui/syxin_old/Specforge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend. warnings.warn( Set TORCH_CUDA_ARCH_LIST to 9.0 Set TORCH_CUDA_ARCH_LIST to 9.0 /workspace/hanrui/syxin_old/Specforge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend. warnings.warn( Set TORCH_CUDA_ARCH_LIST to 9.0 /workspace/hanrui/syxin_old/Specforge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend. warnings.warn( /workspace/hanrui/syxin_old/Specforge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend. warnings.warn( Set TORCH_CUDA_ARCH_LIST to 9.0 Set TORCH_CUDA_ARCH_LIST to 9.0 /workspace/hanrui/syxin_old/Specforge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend. warnings.warn( /workspace/hanrui/syxin_old/Specforge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend. warnings.warn( Set TORCH_CUDA_ARCH_LIST to 9.0 /workspace/hanrui/syxin_old/Specforge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend. warnings.warn( `torch_dtype` is deprecated! Use `dtype` instead! The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details. `torch_dtype` is deprecated! Use `dtype` instead! The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details. Loading checkpoint shards: 0%| | 0/5 [00:00 [rank0]: main() [rank0]: File "/workspace/hanrui/syxin_old/Specforge/scripts/train_dflash_lora_inject.py", line 438, in main [rank0]: for step_in_epoch, data in enumerate(progress_bar): [rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/tqdm/std.py", line 1181, in __iter__ [rank0]: for obj in iterable: [rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 494, in __iter__ [rank0]: return self._get_iterator() [rank0]: ^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 427, in _get_iterator [rank0]: return _MultiProcessingDataLoaderIter(self) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1140, in __init__ [rank0]: index_queue = multiprocessing_context.Queue() # type: ignore[var-annotated] [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/multiprocessing/context.py", line 103, in Queue [rank0]: return Queue(maxsize, ctx=self.get_context()) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/multiprocessing/queues.py", line 49, in __init__ [rank0]: self._sem = ctx.BoundedSemaphore(maxsize) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/multiprocessing/context.py", line 88, in BoundedSemaphore [rank0]: return BoundedSemaphore(value, ctx=self.get_context()) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/multiprocessing/synchronize.py", line 152, in __init__ [rank0]: SemLock.__init__(self, SEMAPHORE, value, value, ctx=ctx) [rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/multiprocessing/synchronize.py", line 57, in __init__ [rank0]: sl = self._semlock = _multiprocessing.SemLock( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: OSError: [Errno 28] No space left on device [rank5]: Traceback (most recent call last): [rank5]: File "/workspace/hanrui/syxin_old/Specforge/scripts/train_dflash_lora_inject.py", line 487, in [rank5]: main() [rank5]: File "/workspace/hanrui/syxin_old/Specforge/scripts/train_dflash_lora_inject.py", line 438, in main [rank5]: for step_in_epoch, data in enumerate(progress_bar): [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^ [rank5]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 494, in __iter__ [rank5]: return self._get_iterator() [rank5]: ^^^^^^^^^^^^^^^^^^^^ [rank5]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 427, in _get_iterator [rank5]: return _MultiProcessingDataLoaderIter(self) [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank5]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1140, in __init__ [rank5]: index_queue = multiprocessing_context.Queue() # type: ignore[var-annotated] [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank5]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/multiprocessing/context.py", line 103, in Queue [rank5]: return Queue(maxsize, ctx=self.get_context()) [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank5]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/multiprocessing/queues.py", line 43, in __init__ [rank5]: self._rlock = ctx.Lock() [rank5]: ^^^^^^^^^^ [rank5]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/multiprocessing/context.py", line 68, in Lock [rank5]: return Lock(ctx=self.get_context()) [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank5]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/multiprocessing/synchronize.py", line 169, in __init__ [rank5]: SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx) [rank5]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/multiprocessing/synchronize.py", line 57, in __init__ [rank5]: sl = self._semlock = _multiprocessing.SemLock( [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^ [rank5]: OSError: [Errno 28] No space left on device [rank1]: Traceback (most recent call last): [rank1]: File "/workspace/hanrui/syxin_old/Specforge/scripts/train_dflash_lora_inject.py", line 487, in [rank1]: main() [rank1]: File "/workspace/hanrui/syxin_old/Specforge/scripts/train_dflash_lora_inject.py", line 438, in main [rank1]: for step_in_epoch, data in enumerate(progress_bar): [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 494, in __iter__ [rank1]: return self._get_iterator() [rank1]: ^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 427, in _get_iterator [rank1]: return _MultiProcessingDataLoaderIter(self) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1140, in __init__ [rank1]: index_queue = multiprocessing_context.Queue() # type: ignore[var-annotated] [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/multiprocessing/context.py", line 103, in Queue [rank1]: return Queue(maxsize, ctx=self.get_context()) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/multiprocessing/queues.py", line 43, in __init__ [rank1]: self._rlock = ctx.Lock() [rank1]: ^^^^^^^^^^ [rank1]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/multiprocessing/context.py", line 68, in Lock [rank1]: return Lock(ctx=self.get_context()) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/multiprocessing/synchronize.py", line 169, in __init__ [rank1]: SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx) [rank1]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/multiprocessing/synchronize.py", line 57, in __init__ [rank1]: sl = self._semlock = _multiprocessing.SemLock( [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: OSError: [Errno 28] No space left on device [rank3]: Traceback (most recent call last): [rank3]: File "/workspace/hanrui/syxin_old/Specforge/scripts/train_dflash_lora_inject.py", line 487, in [rank3]: main() [rank3]: File "/workspace/hanrui/syxin_old/Specforge/scripts/train_dflash_lora_inject.py", line 438, in main [rank3]: for step_in_epoch, data in enumerate(progress_bar): [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 494, in __iter__ [rank3]: return self._get_iterator() [rank3]: ^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 427, in _get_iterator [rank3]: return _MultiProcessingDataLoaderIter(self) [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1140, in __init__ [rank3]: index_queue = multiprocessing_context.Queue() # type: ignore[var-annotated] [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/multiprocessing/context.py", line 103, in Queue [rank3]: return Queue(maxsize, ctx=self.get_context()) [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/multiprocessing/queues.py", line 43, in __init__ [rank3]: self._rlock = ctx.Lock() [rank3]: ^^^^^^^^^^ [rank3]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/multiprocessing/context.py", line 68, in Lock [rank3]: return Lock(ctx=self.get_context()) [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/multiprocessing/synchronize.py", line 169, in __init__ [rank3]: SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx) [rank3]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/multiprocessing/synchronize.py", line 57, in __init__ [rank3]: sl = self._semlock = _multiprocessing.SemLock( [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: OSError: [Errno 28] No space left on device [rank2]: Traceback (most recent call last): [rank2]: File "/workspace/hanrui/syxin_old/Specforge/scripts/train_dflash_lora_inject.py", line 487, in [rank2]: main() [rank2]: File "/workspace/hanrui/syxin_old/Specforge/scripts/train_dflash_lora_inject.py", line 438, in main [rank2]: for step_in_epoch, data in enumerate(progress_bar): [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 494, in __iter__ [rank2]: return self._get_iterator() [rank2]: ^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 427, in _get_iterator [rank2]: return _MultiProcessingDataLoaderIter(self) [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1140, in __init__ [rank2]: index_queue = multiprocessing_context.Queue() # type: ignore[var-annotated] [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/multiprocessing/context.py", line 103, in Queue [rank2]: return Queue(maxsize, ctx=self.get_context()) [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/multiprocessing/queues.py", line 43, in __init__ [rank2]: self._rlock = ctx.Lock() [rank2]: ^^^^^^^^^^ [rank2]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/multiprocessing/context.py", line 68, in Lock [rank2]: return Lock(ctx=self.get_context()) [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/multiprocessing/synchronize.py", line 169, in __init__ [rank2]: SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx) [rank2]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/multiprocessing/synchronize.py", line 57, in __init__ [rank2]: sl = self._semlock = _multiprocessing.SemLock( [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^ [rank2]: OSError: [Errno 28] No space left on device [rank4]: Traceback (most recent call last): [rank4]: File "/workspace/hanrui/syxin_old/Specforge/scripts/train_dflash_lora_inject.py", line 487, in [rank4]: main() [rank4]: File "/workspace/hanrui/syxin_old/Specforge/scripts/train_dflash_lora_inject.py", line 438, in main [rank4]: for step_in_epoch, data in enumerate(progress_bar): [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^ [rank4]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 494, in __iter__ [rank4]: return self._get_iterator() [rank4]: ^^^^^^^^^^^^^^^^^^^^ [rank4]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 427, in _get_iterator [rank4]: return _MultiProcessingDataLoaderIter(self) [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank4]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1140, in __init__ [rank4]: index_queue = multiprocessing_context.Queue() # type: ignore[var-annotated] [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank4]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/multiprocessing/context.py", line 103, in Queue [rank4]: return Queue(maxsize, ctx=self.get_context()) [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank4]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/multiprocessing/queues.py", line 43, in __init__ [rank4]: self._rlock = ctx.Lock() [rank4]: ^^^^^^^^^^ [rank4]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/multiprocessing/context.py", line 68, in Lock [rank4]: return Lock(ctx=self.get_context()) [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank4]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/multiprocessing/synchronize.py", line 169, in __init__ [rank4]: SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx) [rank4]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/multiprocessing/synchronize.py", line 57, in __init__ [rank4]: sl = self._semlock = _multiprocessing.SemLock( [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^ [rank4]: OSError: [Errno 28] No space left on device [rank7]: Traceback (most recent call last): [rank7]: File "/workspace/hanrui/syxin_old/Specforge/scripts/train_dflash_lora_inject.py", line 487, in [rank7]: main() [rank7]: File "/workspace/hanrui/syxin_old/Specforge/scripts/train_dflash_lora_inject.py", line 438, in main [rank7]: for step_in_epoch, data in enumerate(progress_bar): [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^ [rank7]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 494, in __iter__ [rank7]: return self._get_iterator() [rank7]: ^^^^^^^^^^^^^^^^^^^^ [rank7]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 427, in _get_iterator [rank7]: return _MultiProcessingDataLoaderIter(self) [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank7]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1140, in __init__ [rank7]: index_queue = multiprocessing_context.Queue() # type: ignore[var-annotated] [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank7]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/multiprocessing/context.py", line 103, in Queue [rank7]: return Queue(maxsize, ctx=self.get_context()) [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank7]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/multiprocessing/queues.py", line 43, in __init__ [rank7]: self._rlock = ctx.Lock() [rank7]: ^^^^^^^^^^ [rank7]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/multiprocessing/context.py", line 68, in Lock [rank7]: return Lock(ctx=self.get_context()) [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank7]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/multiprocessing/synchronize.py", line 169, in __init__ [rank7]: SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx) [rank7]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/multiprocessing/synchronize.py", line 57, in __init__ [rank7]: sl = self._semlock = _multiprocessing.SemLock( [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^ [rank7]: OSError: [Errno 28] No space left on device [rank6]: Traceback (most recent call last): [rank6]: File "/workspace/hanrui/syxin_old/Specforge/scripts/train_dflash_lora_inject.py", line 487, in [rank6]: main() [rank6]: File "/workspace/hanrui/syxin_old/Specforge/scripts/train_dflash_lora_inject.py", line 438, in main [rank6]: for step_in_epoch, data in enumerate(progress_bar): [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 494, in __iter__ [rank6]: return self._get_iterator() [rank6]: ^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 427, in _get_iterator [rank6]: return _MultiProcessingDataLoaderIter(self) [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1140, in __init__ [rank6]: index_queue = multiprocessing_context.Queue() # type: ignore[var-annotated] [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/multiprocessing/context.py", line 103, in Queue [rank6]: return Queue(maxsize, ctx=self.get_context()) [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/multiprocessing/queues.py", line 43, in __init__ [rank6]: self._rlock = ctx.Lock() [rank6]: ^^^^^^^^^^ [rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/multiprocessing/context.py", line 68, in Lock [rank6]: return Lock(ctx=self.get_context()) [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/multiprocessing/synchronize.py", line 169, in __init__ [rank6]: SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx) [rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/multiprocessing/synchronize.py", line 57, in __init__ [rank6]: sl = self._semlock = _multiprocessing.SemLock( [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^ [rank6]: OSError: [Errno 28] No space left on device [rank0]:[W309 17:04:38.228986039 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) [rank5]:[W309 17:04:38.547654961 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) [rank2]:[W309 17:04:38.725243170 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) [rank6]:[W309 17:04:39.775916194 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) [rank3]:[W309 17:04:39.906748747 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) [rank1]:[W309 17:04:39.092233950 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) [rank4]:[W309 17:04:39.101225015 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) [rank7]:[W309 17:04:39.138781864 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) W0309 17:04:39.582000 9566 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 9662 closing signal SIGTERM W0309 17:04:39.582000 9566 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 9663 closing signal SIGTERM W0309 17:04:39.583000 9566 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 9664 closing signal SIGTERM W0309 17:04:39.583000 9566 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 9665 closing signal SIGTERM W0309 17:04:39.583000 9566 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 9666 closing signal SIGTERM W0309 17:04:39.583000 9566 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 9668 closing signal SIGTERM W0309 17:04:39.584000 9566 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 9669 closing signal SIGTERM E0309 17:04:41.364000 9566 site-packages/torch/distributed/elastic/multiprocessing/api.py:882] failed (exitcode: 1) local_rank: 5 (pid: 9667) of binary: /workspace/miniconda3/envs/specforge/bin/python3 Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/run.py", line 940, in main() File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/run.py", line 936, in main run(args) File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/run.py", line 927, in run elastic_launch( File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 156, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 293, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ scripts/train_dflash_lora_inject.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2026-03-09_17:04:39 host : job-006ce80a7c47-20260302193512-674f5ccb6f-hjq4c rank : 5 (local_rank: 5) exitcode : 1 (pid: 9667) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================