Hanrui / progress /log /dflash_hf_0405_1013.log
Lekr0's picture
Add files using upload-large-folder tool
212a146 verified
nohup: ignoring input
W0405 10:13:37.253000 1134 site-packages/torch/distributed/run.py:803]
W0405 10:13:37.253000 1134 site-packages/torch/distributed/run.py:803] *****************************************
W0405 10:13:37.253000 1134 site-packages/torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0405 10:13:37.253000 1134 site-packages/torch/distributed/run.py:803] *****************************************
Set TORCH_CUDA_ARCH_LIST to 9.0Set TORCH_CUDA_ARCH_LIST to 9.0Set TORCH_CUDA_ARCH_LIST to 9.0
Set TORCH_CUDA_ARCH_LIST to 9.0
Set TORCH_CUDA_ARCH_LIST to 9.0
Set TORCH_CUDA_ARCH_LIST to 9.0Set TORCH_CUDA_ARCH_LIST to 9.0
Set TORCH_CUDA_ARCH_LIST to 9.0
/workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
warnings.warn(
/workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
warnings.warn(
/workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
warnings.warn(
/workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
warnings.warn(
/workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
warnings.warn(
/workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
warnings.warn(
/workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
warnings.warn(
/workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
warnings.warn(
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
INFO:specforge.utils:rank 0: bind to device 0
INFO:specforge.utils:rank 0: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
INFO:specforge.utils:rank 0: Initialized distributed
INFO:specforge.utils:Loading target model from /workspace/models/Qwen3-8B using hf backend
`torch_dtype` is deprecated! Use `dtype` instead!
INFO:specforge.utils:rank 3: bind to device 3
The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
INFO:specforge.utils:rank 4: bind to device 4
INFO:specforge.utils:rank 2: bind to device 2
INFO:specforge.utils:rank 3: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
INFO:specforge.utils:rank 3: Initialized distributed
`torch_dtype` is deprecated! Use `dtype` instead!
INFO:specforge.utils:rank 4: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
INFO:specforge.utils:rank 4: Initialized distributed
`torch_dtype` is deprecated! Use `dtype` instead!
INFO:specforge.utils:rank 2: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
INFO:specforge.utils:rank 2: Initialized distributed
The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
`torch_dtype` is deprecated! Use `dtype` instead!
The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
INFO:specforge.utils:rank 5: bind to device 5
INFO:specforge.utils:rank 5: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
INFO:specforge.utils:rank 5: Initialized distributed
`torch_dtype` is deprecated! Use `dtype` instead!
The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
INFO:specforge.utils:rank 7: bind to device 7
INFO:specforge.utils:rank 1: bind to device 1
INFO:specforge.utils:rank 6: bind to device 6
INFO:specforge.utils:rank 7: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
INFO:specforge.utils:rank 1: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
INFO:specforge.utils:rank 7: Initialized distributed
INFO:specforge.utils:rank 1: Initialized distributed
`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!
INFO:specforge.utils:rank 6: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
INFO:specforge.utils:rank 6: Initialized distributed
`torch_dtype` is deprecated! Use `dtype` instead!
The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s] Loading checkpoint shards: 20%|██ | 1/5 [00:00<00:01, 2.37it/s] Loading checkpoint shards: 20%|██ | 1/5 [00:00<00:01, 2.41it/s] Loading checkpoint shards: 20%|██ | 1/5 [00:00<00:01, 2.44it/s] Loading checkpoint shards: 20%|██ | 1/5 [00:00<00:01, 2.31it/s] Loading checkpoint shards: 20%|██ | 1/5 [00:00<00:01, 2.27it/s] Loading checkpoint shards: 20%|██ | 1/5 [00:00<00:01, 2.41it/s] Loading checkpoint shards: 20%|██ | 1/5 [00:00<00:01, 2.29it/s] Loading checkpoint shards: 20%|██ | 1/5 [00:00<00:01, 2.33it/s] Loading checkpoint shards: 40%|████ | 2/5 [00:01<00:01, 1.82it/s] Loading checkpoint shards: 40%|████ | 2/5 [00:01<00:01, 1.81it/s] Loading checkpoint shards: 40%|████ | 2/5 [00:01<00:01, 1.85it/s] Loading checkpoint shards: 40%|████ | 2/5 [00:01<00:01, 1.84it/s] Loading checkpoint shards: 40%|████ | 2/5 [00:01<00:01, 1.86it/s] Loading checkpoint shards: 40%|████ | 2/5 [00:01<00:01, 1.83it/s] Loading checkpoint shards: 40%|████ | 2/5 [00:01<00:01, 1.85it/s] Loading checkpoint shards: 40%|████ | 2/5 [00:01<00:01, 1.82it/s] Loading checkpoint shards: 60%|██████ | 3/5 [00:01<00:01, 1.79it/s] Loading checkpoint shards: 60%|██████ | 3/5 [00:01<00:01, 1.78it/s] Loading checkpoint shards: 60%|██████ | 3/5 [00:01<00:01, 1.80it/s] Loading checkpoint shards: 60%|██████ | 3/5 [00:01<00:01, 1.80it/s] Loading checkpoint shards: 60%|██████ | 3/5 [00:01<00:01, 1.80it/s] Loading checkpoint shards: 60%|██████ | 3/5 [00:01<00:01, 1.81it/s] Loading checkpoint shards: 60%|██████ | 3/5 [00:01<00:01, 1.79it/s] Loading checkpoint shards: 60%|██████ | 3/5 [00:01<00:01, 1.79it/s] Loading checkpoint shards: 80%|████████ | 4/5 [00:02<00:00, 1.86it/s] Loading checkpoint shards: 80%|████████ | 4/5 [00:02<00:00, 1.87it/s] Loading checkpoint shards: 80%|████████ | 4/5 [00:02<00:00, 1.87it/s] Loading checkpoint shards: 80%|████████ | 4/5 [00:02<00:00, 1.87it/s] Loading checkpoint shards: 80%|████████ | 4/5 [00:02<00:00, 1.87it/s] Loading checkpoint shards: 80%|████████ | 4/5 [00:02<00:00, 1.86it/s] Loading checkpoint shards: 80%|████████ | 4/5 [00:02<00:00, 1.88it/s] Loading checkpoint shards: 80%|████████ | 4/5 [00:02<00:00, 1.87it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:02<00:00, 2.33it/s]
Loading checkpoint shards: 100%|██████████| 5/5 [00:02<00:00, 2.34it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:02<00:00, 2.36it/s]
Loading checkpoint shards: 100%|██████████| 5/5 [00:02<00:00, 2.35it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:02<00:00, 2.36it/s]
Loading checkpoint shards: 100%|██████████| 5/5 [00:02<00:00, 2.34it/s]
Loading checkpoint shards: 100%|██████████| 5/5 [00:02<00:00, 2.34it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:02<00:00, 2.37it/s]
INFO:specforge.utils:Loaded draft config from /workspace/hanrui/SpecForge/configs/qwen3-8b-dflash.json
INFO:specforge.utils:Using attention backend: flex_attention
INFO:specforge.utils:Draft config: block_size=16, num_hidden_layers=5, num_target_layers=36
INFO:specforge.utils:Draft model parameters: 1,048,626,432
INFO:specforge.utils:Using mask_token_id: 151669
INFO:specforge.utils:dflash_config: {'mask_token_id': 151669, 'target_layer_ids': [1, 9, 17, 25, 33]}
Generating train split: 0 examples [00:00, ? examples/s] Generating train split: 1415 examples [00:00, 8811.58 examples/s] Generating train split: 3122 examples [00:00, 11082.04 examples/s] Generating train split: 4875 examples [00:00, 12069.85 examples/s] Generating train split: 6641 examples [00:00, 13076.22 examples/s] Generating train split: 8414 examples [00:00, 13857.65 examples/s] Generating train split: 10156 examples [00:00, 14143.44 examples/s] Generating train split: 11890 examples [00:00, 14012.42 examples/s] Generating train split: 13682 examples [00:01, 9161.16 examples/s] Generating train split: 15433 examples [00:01, 10065.15 examples/s] Generating train split: 16762 examples [00:01, 9910.32 examples/s] Generating train split: 17694 examples [00:01, 9631.51 examples/s]
[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 1890, in _prepare_split_single
[rank0]: writer.write_table(table)
[rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/arrow_writer.py", line 680, in write_table
[rank0]: self.pa_writer.write_table(pa_table, writer_batch_size)
[rank0]: File "pyarrow/ipc.pxi", line 616, in pyarrow.lib._CRecordBatchWriter.write_table
[rank0]: File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
[rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/fsspec/implementations/local.py", line 473, in write
[rank0]: return self.f.write(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: OSError: [Errno 122] Disk quota exceeded
[rank0]: During handling of the above exception, another exception occurred:
[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 1911, in _prepare_split_single
[rank0]: num_examples, num_bytes = writer.finalize()
[rank0]: ^^^^^^^^^^^^^^^^^
[rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/arrow_writer.py", line 693, in finalize
[rank0]: self.stream.close()
[rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/fsspec/implementations/local.py", line 491, in close
[rank0]: return self.f.close()
[rank0]: ^^^^^^^^^^^^^^
[rank0]: OSError: [Errno 122] Disk quota exceeded
[rank0]: The above exception was the direct cause of the following exception:
[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/hanrui/SpecForge/scripts/train_dflash.py", line 562, in <module>
[rank0]: main()
[rank0]: File "/workspace/hanrui/SpecForge/scripts/train_dflash.py", line 415, in main
[rank0]: train_dataloader, eval_dataloader = build_dataloader(args, tokenizer)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/workspace/hanrui/SpecForge/scripts/train_dflash.py", line 222, in build_dataloader
[rank0]: train_dataset = load_dataset("json", data_files=args.train_data_path)["train"]
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/load.py", line 1505, in load_dataset
[rank0]: builder_instance.download_and_prepare(
[rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 884, in download_and_prepare
[rank0]: self._download_and_prepare(
[rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 947, in _download_and_prepare
[rank0]: self._prepare_split(split_generator, **prepare_split_kwargs)
[rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 1739, in _prepare_split
[rank0]: for job_id, done, content in self._prepare_split_single(
[rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 1922, in _prepare_split_single
[rank0]: raise DatasetGenerationError("An error occurred while generating the dataset") from e
[rank0]: datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset
Generating train split: 0 examples [00:00, ? examples/s] Generating train split: 1837 examples [00:00, 11719.94 examples/s] Generating train split: 3552 examples [00:00, 11699.05 examples/s] Generating train split: 5305 examples [00:00, 12818.69 examples/s] Generating train split: 7092 examples [00:00, 13468.71 examples/s][rank0]:[W405 10:15:29.947786516 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Generating train split: 8810 examples [00:00, 13834.01 examples/s] Generating train split: 10577 examples [00:00, 14054.90 examples/s] Generating train split: 12339 examples [00:00, 12805.47 examples/s] Generating train split: 13682 examples [00:01, 9900.64 examples/s] Generating train split: 14998 examples [00:01, 10026.49 examples/s] Generating train split: 16762 examples [00:01, 8287.30 examples/s] Generating train split: 17694 examples [00:01, 7877.81 examples/s] Generating train split: 17694 examples [00:01, 10199.12 examples/s]
[rank6]: Traceback (most recent call last):
[rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 1890, in _prepare_split_single
[rank6]: writer.write_table(table)
[rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/arrow_writer.py", line 680, in write_table
[rank6]: self.pa_writer.write_table(pa_table, writer_batch_size)
[rank6]: File "pyarrow/ipc.pxi", line 616, in pyarrow.lib._CRecordBatchWriter.write_table
[rank6]: File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
[rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/fsspec/implementations/local.py", line 473, in write
[rank6]: return self.f.write(*args, **kwargs)
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: OSError: [Errno 122] Disk quota exceeded
[rank6]: During handling of the above exception, another exception occurred:
[rank6]: Traceback (most recent call last):
[rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 1911, in _prepare_split_single
[rank6]: num_examples, num_bytes = writer.finalize()
[rank6]: ^^^^^^^^^^^^^^^^^
[rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/arrow_writer.py", line 693, in finalize
[rank6]: self.stream.close()
[rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/fsspec/implementations/local.py", line 491, in close
[rank6]: return self.f.close()
[rank6]: ^^^^^^^^^^^^^^
[rank6]: OSError: [Errno 122] Disk quota exceeded
[rank6]: The above exception was the direct cause of the following exception:
[rank6]: Traceback (most recent call last):
[rank6]: File "/workspace/hanrui/SpecForge/scripts/train_dflash.py", line 562, in <module>
[rank6]: main()
[rank6]: File "/workspace/hanrui/SpecForge/scripts/train_dflash.py", line 415, in main
[rank6]: train_dataloader, eval_dataloader = build_dataloader(args, tokenizer)
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/workspace/hanrui/SpecForge/scripts/train_dflash.py", line 222, in build_dataloader
[rank6]: train_dataset = load_dataset("json", data_files=args.train_data_path)["train"]
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/load.py", line 1505, in load_dataset
[rank6]: builder_instance.download_and_prepare(
[rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 884, in download_and_prepare
[rank6]: self._download_and_prepare(
[rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 947, in _download_and_prepare
[rank6]: self._prepare_split(split_generator, **prepare_split_kwargs)
[rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 1739, in _prepare_split
[rank6]: for job_id, done, content in self._prepare_split_single(
[rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 1922, in _prepare_split_single
[rank6]: raise DatasetGenerationError("An error occurred while generating the dataset") from e
[rank6]: datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset
Generating train split: 0 examples [00:00, ? examples/s] Generating train split: 1415 examples [00:00, 9349.59 examples/s] Generating train split: 3122 examples [00:00, 10389.63 examples/s]W0405 10:15:30.839000 1134 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1221 closing signal SIGTERM
W0405 10:15:30.840000 1134 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1222 closing signal SIGTERM
W0405 10:15:30.841000 1134 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1223 closing signal SIGTERM
W0405 10:15:30.841000 1134 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1224 closing signal SIGTERM
W0405 10:15:30.842000 1134 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1225 closing signal SIGTERM
W0405 10:15:30.842000 1134 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1226 closing signal SIGTERM
W0405 10:15:30.843000 1134 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1227 closing signal SIGTERM
E0405 10:15:32.009000 1134 site-packages/torch/distributed/elastic/multiprocessing/api.py:882] failed (exitcode: 1) local_rank: 0 (pid: 1220) of binary: /workspace/miniconda3/envs/specforge/bin/python3.11
Traceback (most recent call last):
File "/workspace/miniconda3/envs/specforge/bin/torchrun", line 6, in <module>
sys.exit(main())
^^^^^^
File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/run.py", line 936, in main
run(args)
File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/run.py", line 927, in run
elastic_launch(
File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 156, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 293, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/workspace/hanrui/SpecForge/scripts/train_dflash.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2026-04-05_10:15:30
host : job-006ce80a7c47-20260302193512-7694985998-5ng4c
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1220)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================