| nohup: ignoring input |
| W0405 10:13:37.253000 1134 site-packages/torch/distributed/run.py:803] |
| W0405 10:13:37.253000 1134 site-packages/torch/distributed/run.py:803] ***************************************** |
| W0405 10:13:37.253000 1134 site-packages/torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. |
| W0405 10:13:37.253000 1134 site-packages/torch/distributed/run.py:803] ***************************************** |
| Set TORCH_CUDA_ARCH_LIST to 9.0Set TORCH_CUDA_ARCH_LIST to 9.0Set TORCH_CUDA_ARCH_LIST to 9.0 |
|
|
|
|
| Set TORCH_CUDA_ARCH_LIST to 9.0 |
| Set TORCH_CUDA_ARCH_LIST to 9.0 |
| Set TORCH_CUDA_ARCH_LIST to 9.0Set TORCH_CUDA_ARCH_LIST to 9.0 |
|
|
| Set TORCH_CUDA_ARCH_LIST to 9.0 |
| /workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend. |
| warnings.warn( |
| /workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend. |
| warnings.warn( |
| /workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend. |
| warnings.warn( |
| /workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend. |
| warnings.warn( |
| /workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend. |
| warnings.warn( |
| /workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend. |
| warnings.warn( |
| /workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend. |
| warnings.warn( |
| /workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend. |
| warnings.warn( |
| <frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. |
| <frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. |
| <frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. |
| <frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. |
| <frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. |
| <frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. |
| <frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. |
| <frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. |
| <frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. |
| <frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. |
| <frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. |
| <frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. |
| <frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. |
| <frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. |
| <frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. |
| <frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. |
| INFO:specforge.utils:rank 0: bind to device 0 |
| INFO:specforge.utils:rank 0: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1)) |
| INFO:specforge.utils:rank 0: Initialized distributed |
| INFO:specforge.utils:Loading target model from /workspace/models/Qwen3-8B using hf backend |
| `torch_dtype` is deprecated! Use `dtype` instead! |
| INFO:specforge.utils:rank 3: bind to device 3 |
| The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details. |
| INFO:specforge.utils:rank 4: bind to device 4 |
| INFO:specforge.utils:rank 2: bind to device 2 |
| INFO:specforge.utils:rank 3: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1)) |
| INFO:specforge.utils:rank 3: Initialized distributed |
| `torch_dtype` is deprecated! Use `dtype` instead! |
| INFO:specforge.utils:rank 4: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1)) |
| INFO:specforge.utils:rank 4: Initialized distributed |
| `torch_dtype` is deprecated! Use `dtype` instead! |
| INFO:specforge.utils:rank 2: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1)) |
| INFO:specforge.utils:rank 2: Initialized distributed |
| The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details. |
| `torch_dtype` is deprecated! Use `dtype` instead! |
| The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details. |
| The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details. |
| INFO:specforge.utils:rank 5: bind to device 5 |
| INFO:specforge.utils:rank 5: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1)) |
| INFO:specforge.utils:rank 5: Initialized distributed |
| `torch_dtype` is deprecated! Use `dtype` instead! |
| The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details. |
| INFO:specforge.utils:rank 7: bind to device 7 |
| INFO:specforge.utils:rank 1: bind to device 1 |
| INFO:specforge.utils:rank 6: bind to device 6 |
| INFO:specforge.utils:rank 7: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1)) |
| INFO:specforge.utils:rank 1: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1)) |
| INFO:specforge.utils:rank 7: Initialized distributed |
| INFO:specforge.utils:rank 1: Initialized distributed |
| `torch_dtype` is deprecated! Use `dtype` instead! |
| `torch_dtype` is deprecated! Use `dtype` instead! |
| INFO:specforge.utils:rank 6: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1)) |
| INFO:specforge.utils:rank 6: Initialized distributed |
| `torch_dtype` is deprecated! Use `dtype` instead! |
| The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details. |
| The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details. |
| The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details. |
|
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 20%|██ | 1/5 [00:00<00:01, 2.37it/s]
Loading checkpoint shards: 20%|██ | 1/5 [00:00<00:01, 2.41it/s]
Loading checkpoint shards: 20%|██ | 1/5 [00:00<00:01, 2.44it/s]
Loading checkpoint shards: 20%|██ | 1/5 [00:00<00:01, 2.31it/s]
Loading checkpoint shards: 20%|██ | 1/5 [00:00<00:01, 2.27it/s]
Loading checkpoint shards: 20%|██ | 1/5 [00:00<00:01, 2.41it/s]
Loading checkpoint shards: 20%|██ | 1/5 [00:00<00:01, 2.29it/s]
Loading checkpoint shards: 20%|██ | 1/5 [00:00<00:01, 2.33it/s]
Loading checkpoint shards: 40%|████ | 2/5 [00:01<00:01, 1.82it/s]
Loading checkpoint shards: 40%|████ | 2/5 [00:01<00:01, 1.81it/s]
Loading checkpoint shards: 40%|████ | 2/5 [00:01<00:01, 1.85it/s]
Loading checkpoint shards: 40%|████ | 2/5 [00:01<00:01, 1.84it/s]
Loading checkpoint shards: 40%|████ | 2/5 [00:01<00:01, 1.86it/s]
Loading checkpoint shards: 40%|████ | 2/5 [00:01<00:01, 1.83it/s]
Loading checkpoint shards: 40%|████ | 2/5 [00:01<00:01, 1.85it/s]
Loading checkpoint shards: 40%|████ | 2/5 [00:01<00:01, 1.82it/s]
Loading checkpoint shards: 60%|██████ | 3/5 [00:01<00:01, 1.79it/s]
Loading checkpoint shards: 60%|██████ | 3/5 [00:01<00:01, 1.78it/s]
Loading checkpoint shards: 60%|██████ | 3/5 [00:01<00:01, 1.80it/s]
Loading checkpoint shards: 60%|██████ | 3/5 [00:01<00:01, 1.80it/s]
Loading checkpoint shards: 60%|██████ | 3/5 [00:01<00:01, 1.80it/s]
Loading checkpoint shards: 60%|██████ | 3/5 [00:01<00:01, 1.81it/s]
Loading checkpoint shards: 60%|██████ | 3/5 [00:01<00:01, 1.79it/s]
Loading checkpoint shards: 60%|██████ | 3/5 [00:01<00:01, 1.79it/s]
Loading checkpoint shards: 80%|████████ | 4/5 [00:02<00:00, 1.86it/s]
Loading checkpoint shards: 80%|████████ | 4/5 [00:02<00:00, 1.87it/s]
Loading checkpoint shards: 80%|████████ | 4/5 [00:02<00:00, 1.87it/s]
Loading checkpoint shards: 80%|████████ | 4/5 [00:02<00:00, 1.87it/s]
Loading checkpoint shards: 80%|████████ | 4/5 [00:02<00:00, 1.87it/s]
Loading checkpoint shards: 80%|████████ | 4/5 [00:02<00:00, 1.86it/s]
Loading checkpoint shards: 80%|████████ | 4/5 [00:02<00:00, 1.88it/s]
Loading checkpoint shards: 80%|████████ | 4/5 [00:02<00:00, 1.87it/s]
Loading checkpoint shards: 100%|██████████| 5/5 [00:02<00:00, 2.33it/s] |
|
Loading checkpoint shards: 100%|██████████| 5/5 [00:02<00:00, 2.34it/s]
Loading checkpoint shards: 100%|██████████| 5/5 [00:02<00:00, 2.36it/s] |
|
|
|
Loading checkpoint shards: 100%|██████████| 5/5 [00:02<00:00, 2.35it/s]
Loading checkpoint shards: 100%|██████████| 5/5 [00:02<00:00, 2.36it/s] |
|
Loading checkpoint shards: 100%|██████████| 5/5 [00:02<00:00, 2.34it/s] |
|
Loading checkpoint shards: 100%|██████████| 5/5 [00:02<00:00, 2.34it/s]
Loading checkpoint shards: 100%|██████████| 5/5 [00:02<00:00, 2.37it/s] |
|
|
|
|
| INFO:specforge.utils:Loaded draft config from /workspace/hanrui/SpecForge/configs/qwen3-8b-dflash.json |
| INFO:specforge.utils:Using attention backend: flex_attention |
| INFO:specforge.utils:Draft config: block_size=16, num_hidden_layers=5, num_target_layers=36 |
| INFO:specforge.utils:Draft model parameters: 1,048,626,432 |
| INFO:specforge.utils:Using mask_token_id: 151669 |
| INFO:specforge.utils:dflash_config: {'mask_token_id': 151669, 'target_layer_ids': [1, 9, 17, 25, 33]} |
|
Generating train split: 0 examples [00:00, ? examples/s]
Generating train split: 1415 examples [00:00, 8811.58 examples/s]
Generating train split: 3122 examples [00:00, 11082.04 examples/s]
Generating train split: 4875 examples [00:00, 12069.85 examples/s]
Generating train split: 6641 examples [00:00, 13076.22 examples/s]
Generating train split: 8414 examples [00:00, 13857.65 examples/s]
Generating train split: 10156 examples [00:00, 14143.44 examples/s]
Generating train split: 11890 examples [00:00, 14012.42 examples/s]
Generating train split: 13682 examples [00:01, 9161.16 examples/s]
Generating train split: 15433 examples [00:01, 10065.15 examples/s]
Generating train split: 16762 examples [00:01, 9910.32 examples/s]
Generating train split: 17694 examples [00:01, 9631.51 examples/s] |
| [rank0]: Traceback (most recent call last): |
| [rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 1890, in _prepare_split_single |
| [rank0]: writer.write_table(table) |
| [rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/arrow_writer.py", line 680, in write_table |
| [rank0]: self.pa_writer.write_table(pa_table, writer_batch_size) |
| [rank0]: File "pyarrow/ipc.pxi", line 616, in pyarrow.lib._CRecordBatchWriter.write_table |
| [rank0]: File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status |
| [rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/fsspec/implementations/local.py", line 473, in write |
| [rank0]: return self.f.write(*args, **kwargs) |
| [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank0]: OSError: [Errno 122] Disk quota exceeded |
| |
| [rank0]: During handling of the above exception, another exception occurred: |
| |
| [rank0]: Traceback (most recent call last): |
| [rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 1911, in _prepare_split_single |
| [rank0]: num_examples, num_bytes = writer.finalize() |
| [rank0]: ^^^^^^^^^^^^^^^^^ |
| [rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/arrow_writer.py", line 693, in finalize |
| [rank0]: self.stream.close() |
| [rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/fsspec/implementations/local.py", line 491, in close |
| [rank0]: return self.f.close() |
| [rank0]: ^^^^^^^^^^^^^^ |
| [rank0]: OSError: [Errno 122] Disk quota exceeded |
| |
| [rank0]: The above exception was the direct cause of the following exception: |
| |
| [rank0]: Traceback (most recent call last): |
| [rank0]: File "/workspace/hanrui/SpecForge/scripts/train_dflash.py", line 562, in <module> |
| [rank0]: main() |
| [rank0]: File "/workspace/hanrui/SpecForge/scripts/train_dflash.py", line 415, in main |
| [rank0]: train_dataloader, eval_dataloader = build_dataloader(args, tokenizer) |
| [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank0]: File "/workspace/hanrui/SpecForge/scripts/train_dflash.py", line 222, in build_dataloader |
| [rank0]: train_dataset = load_dataset("json", data_files=args.train_data_path)["train"] |
| [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/load.py", line 1505, in load_dataset |
| [rank0]: builder_instance.download_and_prepare( |
| [rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 884, in download_and_prepare |
| [rank0]: self._download_and_prepare( |
| [rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 947, in _download_and_prepare |
| [rank0]: self._prepare_split(split_generator, **prepare_split_kwargs) |
| [rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 1739, in _prepare_split |
| [rank0]: for job_id, done, content in self._prepare_split_single( |
| [rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 1922, in _prepare_split_single |
| [rank0]: raise DatasetGenerationError("An error occurred while generating the dataset") from e |
| [rank0]: datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset |
|
Generating train split: 0 examples [00:00, ? examples/s]
Generating train split: 1837 examples [00:00, 11719.94 examples/s]
Generating train split: 3552 examples [00:00, 11699.05 examples/s]
Generating train split: 5305 examples [00:00, 12818.69 examples/s]
Generating train split: 7092 examples [00:00, 13468.71 examples/s][rank0]:[W405 10:15:29.947786516 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) |
|
Generating train split: 8810 examples [00:00, 13834.01 examples/s]
Generating train split: 10577 examples [00:00, 14054.90 examples/s]
Generating train split: 12339 examples [00:00, 12805.47 examples/s]
Generating train split: 13682 examples [00:01, 9900.64 examples/s]
Generating train split: 14998 examples [00:01, 10026.49 examples/s]
Generating train split: 16762 examples [00:01, 8287.30 examples/s]
Generating train split: 17694 examples [00:01, 7877.81 examples/s]
Generating train split: 17694 examples [00:01, 10199.12 examples/s] |
| [rank6]: Traceback (most recent call last): |
| [rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 1890, in _prepare_split_single |
| [rank6]: writer.write_table(table) |
| [rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/arrow_writer.py", line 680, in write_table |
| [rank6]: self.pa_writer.write_table(pa_table, writer_batch_size) |
| [rank6]: File "pyarrow/ipc.pxi", line 616, in pyarrow.lib._CRecordBatchWriter.write_table |
| [rank6]: File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status |
| [rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/fsspec/implementations/local.py", line 473, in write |
| [rank6]: return self.f.write(*args, **kwargs) |
| [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank6]: OSError: [Errno 122] Disk quota exceeded |
| |
| [rank6]: During handling of the above exception, another exception occurred: |
| |
| [rank6]: Traceback (most recent call last): |
| [rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 1911, in _prepare_split_single |
| [rank6]: num_examples, num_bytes = writer.finalize() |
| [rank6]: ^^^^^^^^^^^^^^^^^ |
| [rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/arrow_writer.py", line 693, in finalize |
| [rank6]: self.stream.close() |
| [rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/fsspec/implementations/local.py", line 491, in close |
| [rank6]: return self.f.close() |
| [rank6]: ^^^^^^^^^^^^^^ |
| [rank6]: OSError: [Errno 122] Disk quota exceeded |
| |
| [rank6]: The above exception was the direct cause of the following exception: |
| |
| [rank6]: Traceback (most recent call last): |
| [rank6]: File "/workspace/hanrui/SpecForge/scripts/train_dflash.py", line 562, in <module> |
| [rank6]: main() |
| [rank6]: File "/workspace/hanrui/SpecForge/scripts/train_dflash.py", line 415, in main |
| [rank6]: train_dataloader, eval_dataloader = build_dataloader(args, tokenizer) |
| [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank6]: File "/workspace/hanrui/SpecForge/scripts/train_dflash.py", line 222, in build_dataloader |
| [rank6]: train_dataset = load_dataset("json", data_files=args.train_data_path)["train"] |
| [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| [rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/load.py", line 1505, in load_dataset |
| [rank6]: builder_instance.download_and_prepare( |
| [rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 884, in download_and_prepare |
| [rank6]: self._download_and_prepare( |
| [rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 947, in _download_and_prepare |
| [rank6]: self._prepare_split(split_generator, **prepare_split_kwargs) |
| [rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 1739, in _prepare_split |
| [rank6]: for job_id, done, content in self._prepare_split_single( |
| [rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 1922, in _prepare_split_single |
| [rank6]: raise DatasetGenerationError("An error occurred while generating the dataset") from e |
| [rank6]: datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset |
|
Generating train split: 0 examples [00:00, ? examples/s]
Generating train split: 1415 examples [00:00, 9349.59 examples/s]
Generating train split: 3122 examples [00:00, 10389.63 examples/s]W0405 10:15:30.839000 1134 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1221 closing signal SIGTERM |
| W0405 10:15:30.840000 1134 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1222 closing signal SIGTERM |
| W0405 10:15:30.841000 1134 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1223 closing signal SIGTERM |
| W0405 10:15:30.841000 1134 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1224 closing signal SIGTERM |
| W0405 10:15:30.842000 1134 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1225 closing signal SIGTERM |
| W0405 10:15:30.842000 1134 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1226 closing signal SIGTERM |
| W0405 10:15:30.843000 1134 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1227 closing signal SIGTERM |
| E0405 10:15:32.009000 1134 site-packages/torch/distributed/elastic/multiprocessing/api.py:882] failed (exitcode: 1) local_rank: 0 (pid: 1220) of binary: /workspace/miniconda3/envs/specforge/bin/python3.11 |
| Traceback (most recent call last): |
| File "/workspace/miniconda3/envs/specforge/bin/torchrun", line 6, in <module> |
| sys.exit(main()) |
| ^^^^^^ |
| File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper |
| return f(*args, **kwargs) |
| ^^^^^^^^^^^^^^^^^^ |
| File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/run.py", line 936, in main |
| run(args) |
| File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/run.py", line 927, in run |
| elastic_launch( |
| File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 156, in __call__ |
| return launch_agent(self._config, self._entrypoint, list(args)) |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 293, in launch_agent |
| raise ChildFailedError( |
| torch.distributed.elastic.multiprocessing.errors.ChildFailedError: |
| ============================================================ |
| /workspace/hanrui/SpecForge/scripts/train_dflash.py FAILED |
| ------------------------------------------------------------ |
| Failures: |
| <NO_OTHER_FAILURES> |
| ------------------------------------------------------------ |
| Root Cause (first observed failure): |
| [0]: |
| time : 2026-04-05_10:15:30 |
| host : job-006ce80a7c47-20260302193512-7694985998-5ng4c |
| rank : 0 (local_rank: 0) |
| exitcode : 1 (pid: 1220) |
| error_file: <N/A> |
| traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
| ============================================================ |
| |