nohup: ignoring input W0405 10:13:37.253000 1134 site-packages/torch/distributed/run.py:803] W0405 10:13:37.253000 1134 site-packages/torch/distributed/run.py:803] ***************************************** W0405 10:13:37.253000 1134 site-packages/torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0405 10:13:37.253000 1134 site-packages/torch/distributed/run.py:803] ***************************************** Set TORCH_CUDA_ARCH_LIST to 9.0Set TORCH_CUDA_ARCH_LIST to 9.0Set TORCH_CUDA_ARCH_LIST to 9.0 Set TORCH_CUDA_ARCH_LIST to 9.0 Set TORCH_CUDA_ARCH_LIST to 9.0 Set TORCH_CUDA_ARCH_LIST to 9.0Set TORCH_CUDA_ARCH_LIST to 9.0 Set TORCH_CUDA_ARCH_LIST to 9.0 /workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend. warnings.warn( /workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend. warnings.warn( /workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend. warnings.warn( /workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend. warnings.warn( /workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend. warnings.warn( /workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend. warnings.warn( /workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend. warnings.warn( /workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend. warnings.warn( :1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. :1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. :1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. :1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. :1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. :1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. :1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. :1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. :1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. :1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. :1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. :1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. :1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. :1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. :1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. :1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. INFO:specforge.utils:rank 0: bind to device 0 INFO:specforge.utils:rank 0: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1)) INFO:specforge.utils:rank 0: Initialized distributed INFO:specforge.utils:Loading target model from /workspace/models/Qwen3-8B using hf backend `torch_dtype` is deprecated! Use `dtype` instead! INFO:specforge.utils:rank 3: bind to device 3 The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details. INFO:specforge.utils:rank 4: bind to device 4 INFO:specforge.utils:rank 2: bind to device 2 INFO:specforge.utils:rank 3: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1)) INFO:specforge.utils:rank 3: Initialized distributed `torch_dtype` is deprecated! Use `dtype` instead! INFO:specforge.utils:rank 4: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1)) INFO:specforge.utils:rank 4: Initialized distributed `torch_dtype` is deprecated! Use `dtype` instead! INFO:specforge.utils:rank 2: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1)) INFO:specforge.utils:rank 2: Initialized distributed The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details. `torch_dtype` is deprecated! Use `dtype` instead! The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details. The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details. INFO:specforge.utils:rank 5: bind to device 5 INFO:specforge.utils:rank 5: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1)) INFO:specforge.utils:rank 5: Initialized distributed `torch_dtype` is deprecated! Use `dtype` instead! The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details. INFO:specforge.utils:rank 7: bind to device 7 INFO:specforge.utils:rank 1: bind to device 1 INFO:specforge.utils:rank 6: bind to device 6 INFO:specforge.utils:rank 7: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1)) INFO:specforge.utils:rank 1: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1)) INFO:specforge.utils:rank 7: Initialized distributed INFO:specforge.utils:rank 1: Initialized distributed `torch_dtype` is deprecated! Use `dtype` instead! `torch_dtype` is deprecated! Use `dtype` instead! INFO:specforge.utils:rank 6: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1)) INFO:specforge.utils:rank 6: Initialized distributed `torch_dtype` is deprecated! Use `dtype` instead! The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details. The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details. The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details. Loading checkpoint shards: 0%| | 0/5 [00:00 [rank0]: main() [rank0]: File "/workspace/hanrui/SpecForge/scripts/train_dflash.py", line 415, in main [rank0]: train_dataloader, eval_dataloader = build_dataloader(args, tokenizer) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/workspace/hanrui/SpecForge/scripts/train_dflash.py", line 222, in build_dataloader [rank0]: train_dataset = load_dataset("json", data_files=args.train_data_path)["train"] [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/load.py", line 1505, in load_dataset [rank0]: builder_instance.download_and_prepare( [rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 884, in download_and_prepare [rank0]: self._download_and_prepare( [rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 947, in _download_and_prepare [rank0]: self._prepare_split(split_generator, **prepare_split_kwargs) [rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 1739, in _prepare_split [rank0]: for job_id, done, content in self._prepare_split_single( [rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 1922, in _prepare_split_single [rank0]: raise DatasetGenerationError("An error occurred while generating the dataset") from e [rank0]: datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset Generating train split: 0 examples [00:00, ? examples/s] Generating train split: 1837 examples [00:00, 11719.94 examples/s] Generating train split: 3552 examples [00:00, 11699.05 examples/s] Generating train split: 5305 examples [00:00, 12818.69 examples/s] Generating train split: 7092 examples [00:00, 13468.71 examples/s][rank0]:[W405 10:15:29.947786516 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) Generating train split: 8810 examples [00:00, 13834.01 examples/s] Generating train split: 10577 examples [00:00, 14054.90 examples/s] Generating train split: 12339 examples [00:00, 12805.47 examples/s] Generating train split: 13682 examples [00:01, 9900.64 examples/s] Generating train split: 14998 examples [00:01, 10026.49 examples/s] Generating train split: 16762 examples [00:01, 8287.30 examples/s] Generating train split: 17694 examples [00:01, 7877.81 examples/s] Generating train split: 17694 examples [00:01, 10199.12 examples/s] [rank6]: Traceback (most recent call last): [rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 1890, in _prepare_split_single [rank6]: writer.write_table(table) [rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/arrow_writer.py", line 680, in write_table [rank6]: self.pa_writer.write_table(pa_table, writer_batch_size) [rank6]: File "pyarrow/ipc.pxi", line 616, in pyarrow.lib._CRecordBatchWriter.write_table [rank6]: File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status [rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/fsspec/implementations/local.py", line 473, in write [rank6]: return self.f.write(*args, **kwargs) [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank6]: OSError: [Errno 122] Disk quota exceeded [rank6]: During handling of the above exception, another exception occurred: [rank6]: Traceback (most recent call last): [rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 1911, in _prepare_split_single [rank6]: num_examples, num_bytes = writer.finalize() [rank6]: ^^^^^^^^^^^^^^^^^ [rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/arrow_writer.py", line 693, in finalize [rank6]: self.stream.close() [rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/fsspec/implementations/local.py", line 491, in close [rank6]: return self.f.close() [rank6]: ^^^^^^^^^^^^^^ [rank6]: OSError: [Errno 122] Disk quota exceeded [rank6]: The above exception was the direct cause of the following exception: [rank6]: Traceback (most recent call last): [rank6]: File "/workspace/hanrui/SpecForge/scripts/train_dflash.py", line 562, in [rank6]: main() [rank6]: File "/workspace/hanrui/SpecForge/scripts/train_dflash.py", line 415, in main [rank6]: train_dataloader, eval_dataloader = build_dataloader(args, tokenizer) [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/workspace/hanrui/SpecForge/scripts/train_dflash.py", line 222, in build_dataloader [rank6]: train_dataset = load_dataset("json", data_files=args.train_data_path)["train"] [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/load.py", line 1505, in load_dataset [rank6]: builder_instance.download_and_prepare( [rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 884, in download_and_prepare [rank6]: self._download_and_prepare( [rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 947, in _download_and_prepare [rank6]: self._prepare_split(split_generator, **prepare_split_kwargs) [rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 1739, in _prepare_split [rank6]: for job_id, done, content in self._prepare_split_single( [rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 1922, in _prepare_split_single [rank6]: raise DatasetGenerationError("An error occurred while generating the dataset") from e [rank6]: datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset Generating train split: 0 examples [00:00, ? examples/s] Generating train split: 1415 examples [00:00, 9349.59 examples/s] Generating train split: 3122 examples [00:00, 10389.63 examples/s]W0405 10:15:30.839000 1134 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1221 closing signal SIGTERM W0405 10:15:30.840000 1134 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1222 closing signal SIGTERM W0405 10:15:30.841000 1134 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1223 closing signal SIGTERM W0405 10:15:30.841000 1134 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1224 closing signal SIGTERM W0405 10:15:30.842000 1134 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1225 closing signal SIGTERM W0405 10:15:30.842000 1134 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1226 closing signal SIGTERM W0405 10:15:30.843000 1134 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1227 closing signal SIGTERM E0405 10:15:32.009000 1134 site-packages/torch/distributed/elastic/multiprocessing/api.py:882] failed (exitcode: 1) local_rank: 0 (pid: 1220) of binary: /workspace/miniconda3/envs/specforge/bin/python3.11 Traceback (most recent call last): File "/workspace/miniconda3/envs/specforge/bin/torchrun", line 6, in sys.exit(main()) ^^^^^^ File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/run.py", line 936, in main run(args) File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/run.py", line 927, in run elastic_launch( File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 156, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 293, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /workspace/hanrui/SpecForge/scripts/train_dflash.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2026-04-05_10:15:30 host : job-006ce80a7c47-20260302193512-7694985998-5ng4c rank : 0 (local_rank: 0) exitcode : 1 (pid: 1220) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================