Hanrui / progress /log /dflash_hf_0405_1013.log

Add files using upload-large-folder tool

212a146 verified about 2 months ago

25.8 kB

	nohup: ignoring input
	W0405 10:13:37.253000 1134 site-packages/torch/distributed/run.py:803]
	W0405 10:13:37.253000 1134 site-packages/torch/distributed/run.py:803] *****************************************
	W0405 10:13:37.253000 1134 site-packages/torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
	W0405 10:13:37.253000 1134 site-packages/torch/distributed/run.py:803] *****************************************
	Set TORCH_CUDA_ARCH_LIST to 9.0Set TORCH_CUDA_ARCH_LIST to 9.0Set TORCH_CUDA_ARCH_LIST to 9.0


	Set TORCH_CUDA_ARCH_LIST to 9.0
	Set TORCH_CUDA_ARCH_LIST to 9.0
	Set TORCH_CUDA_ARCH_LIST to 9.0Set TORCH_CUDA_ARCH_LIST to 9.0

	Set TORCH_CUDA_ARCH_LIST to 9.0
	/workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
	warnings.warn(
	/workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
	warnings.warn(
	/workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
	warnings.warn(
	/workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
	warnings.warn(
	/workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
	warnings.warn(
	/workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
	warnings.warn(
	/workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
	warnings.warn(
	/workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
	warnings.warn(
	<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
	<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
	<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
	<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
	<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
	<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
	<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
	<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
	<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
	<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
	<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
	<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
	<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
	<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
	<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
	<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
	INFO:specforge.utils:rank 0: bind to device 0
	INFO:specforge.utils:rank 0: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
	INFO:specforge.utils:rank 0: Initialized distributed
	INFO:specforge.utils:Loading target model from /workspace/models/Qwen3-8B using hf backend
	`torch_dtype` is deprecated! Use `dtype` instead!
	INFO:specforge.utils:rank 3: bind to device 3
	The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
	INFO:specforge.utils:rank 4: bind to device 4
	INFO:specforge.utils:rank 2: bind to device 2
	INFO:specforge.utils:rank 3: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
	INFO:specforge.utils:rank 3: Initialized distributed
	`torch_dtype` is deprecated! Use `dtype` instead!
	INFO:specforge.utils:rank 4: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
	INFO:specforge.utils:rank 4: Initialized distributed
	`torch_dtype` is deprecated! Use `dtype` instead!
	INFO:specforge.utils:rank 2: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
	INFO:specforge.utils:rank 2: Initialized distributed
	The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
	`torch_dtype` is deprecated! Use `dtype` instead!
	The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
	The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
	INFO:specforge.utils:rank 5: bind to device 5
	INFO:specforge.utils:rank 5: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
	INFO:specforge.utils:rank 5: Initialized distributed
	`torch_dtype` is deprecated! Use `dtype` instead!
	The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
	INFO:specforge.utils:rank 7: bind to device 7
	INFO:specforge.utils:rank 1: bind to device 1
	INFO:specforge.utils:rank 6: bind to device 6
	INFO:specforge.utils:rank 7: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
	INFO:specforge.utils:rank 1: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
	INFO:specforge.utils:rank 7: Initialized distributed
	INFO:specforge.utils:rank 1: Initialized distributed
	`torch_dtype` is deprecated! Use `dtype` instead!
	`torch_dtype` is deprecated! Use `dtype` instead!
	INFO:specforge.utils:rank 6: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
	INFO:specforge.utils:rank 6: Initialized distributed
	`torch_dtype` is deprecated! Use `dtype` instead!
	The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
	The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
	The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
	Loading checkpoint shards: 0%\| \| 0/5 [00:00<?, ?it/s] Loading checkpoint shards: 0%\| \| 0/5 [00:00<?, ?it/s] Loading checkpoint shards: 0%\| \| 0/5 [00:00<?, ?it/s] Loading checkpoint shards: 0%\| \| 0/5 [00:00<?, ?it/s] Loading checkpoint shards: 0%\| \| 0/5 [00:00<?, ?it/s] Loading checkpoint shards: 0%\| \| 0/5 [00:00<?, ?it/s] Loading checkpoint shards: 0%\| \| 0/5 [00:00<?, ?it/s] Loading checkpoint shards: 0%\| \| 0/5 [00:00<?, ?it/s] Loading checkpoint shards: 20%\|██ \| 1/5 [00:00<00:01, 2.37it/s] Loading checkpoint shards: 20%\|██ \| 1/5 [00:00<00:01, 2.41it/s] Loading checkpoint shards: 20%\|██ \| 1/5 [00:00<00:01, 2.44it/s] Loading checkpoint shards: 20%\|██ \| 1/5 [00:00<00:01, 2.31it/s] Loading checkpoint shards: 20%\|██ \| 1/5 [00:00<00:01, 2.27it/s] Loading checkpoint shards: 20%\|██ \| 1/5 [00:00<00:01, 2.41it/s] Loading checkpoint shards: 20%\|██ \| 1/5 [00:00<00:01, 2.29it/s] Loading checkpoint shards: 20%\|██ \| 1/5 [00:00<00:01, 2.33it/s] Loading checkpoint shards: 40%\|████ \| 2/5 [00:01<00:01, 1.82it/s] Loading checkpoint shards: 40%\|████ \| 2/5 [00:01<00:01, 1.81it/s] Loading checkpoint shards: 40%\|████ \| 2/5 [00:01<00:01, 1.85it/s] Loading checkpoint shards: 40%\|████ \| 2/5 [00:01<00:01, 1.84it/s] Loading checkpoint shards: 40%\|████ \| 2/5 [00:01<00:01, 1.86it/s] Loading checkpoint shards: 40%\|████ \| 2/5 [00:01<00:01, 1.83it/s] Loading checkpoint shards: 40%\|████ \| 2/5 [00:01<00:01, 1.85it/s] Loading checkpoint shards: 40%\|████ \| 2/5 [00:01<00:01, 1.82it/s] Loading checkpoint shards: 60%\|██████ \| 3/5 [00:01<00:01, 1.79it/s] Loading checkpoint shards: 60%\|██████ \| 3/5 [00:01<00:01, 1.78it/s] Loading checkpoint shards: 60%\|██████ \| 3/5 [00:01<00:01, 1.80it/s] Loading checkpoint shards: 60%\|██████ \| 3/5 [00:01<00:01, 1.80it/s] Loading checkpoint shards: 60%\|██████ \| 3/5 [00:01<00:01, 1.80it/s] Loading checkpoint shards: 60%\|██████ \| 3/5 [00:01<00:01, 1.81it/s] Loading checkpoint shards: 60%\|██████ \| 3/5 [00:01<00:01, 1.79it/s] Loading checkpoint shards: 60%\|██████ \| 3/5 [00:01<00:01, 1.79it/s] Loading checkpoint shards: 80%\|████████ \| 4/5 [00:02<00:00, 1.86it/s] Loading checkpoint shards: 80%\|████████ \| 4/5 [00:02<00:00, 1.87it/s] Loading checkpoint shards: 80%\|████████ \| 4/5 [00:02<00:00, 1.87it/s] Loading checkpoint shards: 80%\|████████ \| 4/5 [00:02<00:00, 1.87it/s] Loading checkpoint shards: 80%\|████████ \| 4/5 [00:02<00:00, 1.87it/s] Loading checkpoint shards: 80%\|████████ \| 4/5 [00:02<00:00, 1.86it/s] Loading checkpoint shards: 80%\|████████ \| 4/5 [00:02<00:00, 1.88it/s] Loading checkpoint shards: 80%\|████████ \| 4/5 [00:02<00:00, 1.87it/s] Loading checkpoint shards: 100%\|██████████\| 5/5 [00:02<00:00, 2.33it/s]
	Loading checkpoint shards: 100%\|██████████\| 5/5 [00:02<00:00, 2.34it/s] Loading checkpoint shards: 100%\|██████████\| 5/5 [00:02<00:00, 2.36it/s]

	Loading checkpoint shards: 100%\|██████████\| 5/5 [00:02<00:00, 2.35it/s] Loading checkpoint shards: 100%\|██████████\| 5/5 [00:02<00:00, 2.36it/s]
	Loading checkpoint shards: 100%\|██████████\| 5/5 [00:02<00:00, 2.34it/s]
	Loading checkpoint shards: 100%\|██████████\| 5/5 [00:02<00:00, 2.34it/s] Loading checkpoint shards: 100%\|██████████\| 5/5 [00:02<00:00, 2.37it/s]


	INFO:specforge.utils:Loaded draft config from /workspace/hanrui/SpecForge/configs/qwen3-8b-dflash.json
	INFO:specforge.utils:Using attention backend: flex_attention
	INFO:specforge.utils:Draft config: block_size=16, num_hidden_layers=5, num_target_layers=36
	INFO:specforge.utils:Draft model parameters: 1,048,626,432
	INFO:specforge.utils:Using mask_token_id: 151669
	INFO:specforge.utils:dflash_config: {'mask_token_id': 151669, 'target_layer_ids': [1, 9, 17, 25, 33]}
	Generating train split: 0 examples [00:00, ? examples/s] Generating train split: 1415 examples [00:00, 8811.58 examples/s] Generating train split: 3122 examples [00:00, 11082.04 examples/s] Generating train split: 4875 examples [00:00, 12069.85 examples/s] Generating train split: 6641 examples [00:00, 13076.22 examples/s] Generating train split: 8414 examples [00:00, 13857.65 examples/s] Generating train split: 10156 examples [00:00, 14143.44 examples/s] Generating train split: 11890 examples [00:00, 14012.42 examples/s] Generating train split: 13682 examples [00:01, 9161.16 examples/s] Generating train split: 15433 examples [00:01, 10065.15 examples/s] Generating train split: 16762 examples [00:01, 9910.32 examples/s] Generating train split: 17694 examples [00:01, 9631.51 examples/s]
	[rank0]: Traceback (most recent call last):
	[rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 1890, in _prepare_split_single
	[rank0]: writer.write_table(table)
	[rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/arrow_writer.py", line 680, in write_table
	[rank0]: self.pa_writer.write_table(pa_table, writer_batch_size)
	[rank0]: File "pyarrow/ipc.pxi", line 616, in pyarrow.lib._CRecordBatchWriter.write_table
	[rank0]: File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
	[rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/fsspec/implementations/local.py", line 473, in write
	[rank0]: return self.f.write(args, *kwargs)
	[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	[rank0]: OSError: [Errno 122] Disk quota exceeded

	[rank0]: During handling of the above exception, another exception occurred:

	[rank0]: Traceback (most recent call last):
	[rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 1911, in _prepare_split_single
	[rank0]: num_examples, num_bytes = writer.finalize()
	[rank0]: ^^^^^^^^^^^^^^^^^
	[rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/arrow_writer.py", line 693, in finalize
	[rank0]: self.stream.close()
	[rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/fsspec/implementations/local.py", line 491, in close
	[rank0]: return self.f.close()
	[rank0]: ^^^^^^^^^^^^^^
	[rank0]: OSError: [Errno 122] Disk quota exceeded

	[rank0]: The above exception was the direct cause of the following exception:

	[rank0]: Traceback (most recent call last):
	[rank0]: File "/workspace/hanrui/SpecForge/scripts/train_dflash.py", line 562, in <module>
	[rank0]: main()
	[rank0]: File "/workspace/hanrui/SpecForge/scripts/train_dflash.py", line 415, in main
	[rank0]: train_dataloader, eval_dataloader = build_dataloader(args, tokenizer)
	[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	[rank0]: File "/workspace/hanrui/SpecForge/scripts/train_dflash.py", line 222, in build_dataloader
	[rank0]: train_dataset = load_dataset("json", data_files=args.train_data_path)["train"]
	[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	[rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/load.py", line 1505, in load_dataset
	[rank0]: builder_instance.download_and_prepare(
	[rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 884, in download_and_prepare
	[rank0]: self._download_and_prepare(
	[rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 947, in _download_and_prepare
	[rank0]: self._prepare_split(split_generator, **prepare_split_kwargs)
	[rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 1739, in _prepare_split
	[rank0]: for job_id, done, content in self._prepare_split_single(
	[rank0]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 1922, in _prepare_split_single
	[rank0]: raise DatasetGenerationError("An error occurred while generating the dataset") from e
	[rank0]: datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset
	Generating train split: 0 examples [00:00, ? examples/s] Generating train split: 1837 examples [00:00, 11719.94 examples/s] Generating train split: 3552 examples [00:00, 11699.05 examples/s] Generating train split: 5305 examples [00:00, 12818.69 examples/s] Generating train split: 7092 examples [00:00, 13468.71 examples/s][rank0]:[W405 10:15:29.947786516 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
	Generating train split: 8810 examples [00:00, 13834.01 examples/s] Generating train split: 10577 examples [00:00, 14054.90 examples/s] Generating train split: 12339 examples [00:00, 12805.47 examples/s] Generating train split: 13682 examples [00:01, 9900.64 examples/s] Generating train split: 14998 examples [00:01, 10026.49 examples/s] Generating train split: 16762 examples [00:01, 8287.30 examples/s] Generating train split: 17694 examples [00:01, 7877.81 examples/s] Generating train split: 17694 examples [00:01, 10199.12 examples/s]
	[rank6]: Traceback (most recent call last):
	[rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 1890, in _prepare_split_single
	[rank6]: writer.write_table(table)
	[rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/arrow_writer.py", line 680, in write_table
	[rank6]: self.pa_writer.write_table(pa_table, writer_batch_size)
	[rank6]: File "pyarrow/ipc.pxi", line 616, in pyarrow.lib._CRecordBatchWriter.write_table
	[rank6]: File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
	[rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/fsspec/implementations/local.py", line 473, in write
	[rank6]: return self.f.write(args, *kwargs)
	[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	[rank6]: OSError: [Errno 122] Disk quota exceeded

	[rank6]: During handling of the above exception, another exception occurred:

	[rank6]: Traceback (most recent call last):
	[rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 1911, in _prepare_split_single
	[rank6]: num_examples, num_bytes = writer.finalize()
	[rank6]: ^^^^^^^^^^^^^^^^^
	[rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/arrow_writer.py", line 693, in finalize
	[rank6]: self.stream.close()
	[rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/fsspec/implementations/local.py", line 491, in close
	[rank6]: return self.f.close()
	[rank6]: ^^^^^^^^^^^^^^
	[rank6]: OSError: [Errno 122] Disk quota exceeded

	[rank6]: The above exception was the direct cause of the following exception:

	[rank6]: Traceback (most recent call last):
	[rank6]: File "/workspace/hanrui/SpecForge/scripts/train_dflash.py", line 562, in <module>
	[rank6]: main()
	[rank6]: File "/workspace/hanrui/SpecForge/scripts/train_dflash.py", line 415, in main
	[rank6]: train_dataloader, eval_dataloader = build_dataloader(args, tokenizer)
	[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	[rank6]: File "/workspace/hanrui/SpecForge/scripts/train_dflash.py", line 222, in build_dataloader
	[rank6]: train_dataset = load_dataset("json", data_files=args.train_data_path)["train"]
	[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	[rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/load.py", line 1505, in load_dataset
	[rank6]: builder_instance.download_and_prepare(
	[rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 884, in download_and_prepare
	[rank6]: self._download_and_prepare(
	[rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 947, in _download_and_prepare
	[rank6]: self._prepare_split(split_generator, **prepare_split_kwargs)
	[rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 1739, in _prepare_split
	[rank6]: for job_id, done, content in self._prepare_split_single(
	[rank6]: File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 1922, in _prepare_split_single
	[rank6]: raise DatasetGenerationError("An error occurred while generating the dataset") from e
	[rank6]: datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset
	Generating train split: 0 examples [00:00, ? examples/s] Generating train split: 1415 examples [00:00, 9349.59 examples/s] Generating train split: 3122 examples [00:00, 10389.63 examples/s]W0405 10:15:30.839000 1134 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1221 closing signal SIGTERM
	W0405 10:15:30.840000 1134 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1222 closing signal SIGTERM
	W0405 10:15:30.841000 1134 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1223 closing signal SIGTERM
	W0405 10:15:30.841000 1134 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1224 closing signal SIGTERM
	W0405 10:15:30.842000 1134 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1225 closing signal SIGTERM
	W0405 10:15:30.842000 1134 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1226 closing signal SIGTERM
	W0405 10:15:30.843000 1134 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1227 closing signal SIGTERM
	E0405 10:15:32.009000 1134 site-packages/torch/distributed/elastic/multiprocessing/api.py:882] failed (exitcode: 1) local_rank: 0 (pid: 1220) of binary: /workspace/miniconda3/envs/specforge/bin/python3.11
	Traceback (most recent call last):
	File "/workspace/miniconda3/envs/specforge/bin/torchrun", line 6, in <module>
	sys.exit(main())
	^^^^^^
	File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
	return f(args, *kwargs)
	^^^^^^^^^^^^^^^^^^
	File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/run.py", line 936, in main
	run(args)
	File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/run.py", line 927, in run
	elastic_launch(
	File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 156, in __call__
	return launch_agent(self._config, self._entrypoint, list(args))
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 293, in launch_agent
	raise ChildFailedError(
	torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
	============================================================
	/workspace/hanrui/SpecForge/scripts/train_dflash.py FAILED
	------------------------------------------------------------
	Failures:
	<NO_OTHER_FAILURES>
	------------------------------------------------------------
	Root Cause (first observed failure):
	[0]:
	time : 2026-04-05_10:15:30
	host : job-006ce80a7c47-20260302193512-7694985998-5ng4c
	rank : 0 (local_rank: 0)
	exitcode : 1 (pid: 1220)
	error_file: <N/A>
	traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
	============================================================