File size: 25,761 Bytes

212a146

nohup: ignoring input
W0405 10:13:37.253000 1134 site-packages/torch/distributed/run.py:803] 
W0405 10:13:37.253000 1134 site-packages/torch/distributed/run.py:803] *****************************************
W0405 10:13:37.253000 1134 site-packages/torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0405 10:13:37.253000 1134 site-packages/torch/distributed/run.py:803] *****************************************
Set TORCH_CUDA_ARCH_LIST to 9.0Set TORCH_CUDA_ARCH_LIST to 9.0Set TORCH_CUDA_ARCH_LIST to 9.0


Set TORCH_CUDA_ARCH_LIST to 9.0
Set TORCH_CUDA_ARCH_LIST to 9.0
Set TORCH_CUDA_ARCH_LIST to 9.0Set TORCH_CUDA_ARCH_LIST to 9.0

Set TORCH_CUDA_ARCH_LIST to 9.0
/workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
  warnings.warn(
/workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
  warnings.warn(
/workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
  warnings.warn(
/workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
  warnings.warn(
/workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
  warnings.warn(
/workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
  warnings.warn(
/workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
  warnings.warn(
/workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
  warnings.warn(
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
INFO:specforge.utils:rank 0: bind to device 0
INFO:specforge.utils:rank 0: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
INFO:specforge.utils:rank 0: Initialized distributed
INFO:specforge.utils:Loading target model from /workspace/models/Qwen3-8B using hf backend
`torch_dtype` is deprecated! Use `dtype` instead!
INFO:specforge.utils:rank 3: bind to device 3
The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
INFO:specforge.utils:rank 4: bind to device 4
INFO:specforge.utils:rank 2: bind to device 2
INFO:specforge.utils:rank 3: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
INFO:specforge.utils:rank 3: Initialized distributed
`torch_dtype` is deprecated! Use `dtype` instead!
INFO:specforge.utils:rank 4: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
INFO:specforge.utils:rank 4: Initialized distributed
`torch_dtype` is deprecated! Use `dtype` instead!
INFO:specforge.utils:rank 2: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
INFO:specforge.utils:rank 2: Initialized distributed
The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
`torch_dtype` is deprecated! Use `dtype` instead!
The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
INFO:specforge.utils:rank 5: bind to device 5
INFO:specforge.utils:rank 5: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
INFO:specforge.utils:rank 5: Initialized distributed
`torch_dtype` is deprecated! Use `dtype` instead!
The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
INFO:specforge.utils:rank 7: bind to device 7
INFO:specforge.utils:rank 1: bind to device 1
INFO:specforge.utils:rank 6: bind to device 6
INFO:specforge.utils:rank 7: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
INFO:specforge.utils:rank 1: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
INFO:specforge.utils:rank 7: Initialized distributed
INFO:specforge.utils:rank 1: Initialized distributed
`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!
INFO:specforge.utils:rank 6: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
INFO:specforge.utils:rank 6: Initialized distributed
`torch_dtype` is deprecated! Use `dtype` instead!
The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards:  20%|██        | 1/5 [00:00<00:01,  2.37it/s]
Loading checkpoint shards:  20%|██        | 1/5 [00:00<00:01,  2.41it/s]
Loading checkpoint shards:  20%|██        | 1/5 [00:00<00:01,  2.44it/s]
Loading checkpoint shards:  20%|██        | 1/5 [00:00<00:01,  2.31it/s]
Loading checkpoint shards:  20%|██        | 1/5 [00:00<00:01,  2.27it/s]
Loading checkpoint shards:  20%|██        | 1/5 [00:00<00:01,  2.41it/s]
Loading checkpoint shards:  20%|██        | 1/5 [00:00<00:01,  2.29it/s]
Loading checkpoint shards:  20%|██        | 1/5 [00:00<00:01,  2.33it/s]
Loading checkpoint shards:  40%|████      | 2/5 [00:01<00:01,  1.82it/s]
Loading checkpoint shards:  40%|████      | 2/5 [00:01<00:01,  1.81it/s]
Loading checkpoint shards:  40%|████      | 2/5 [00:01<00:01,  1.85it/s]
Loading checkpoint shards:  40%|████      | 2/5 [00:01<00:01,  1.84it/s]
Loading checkpoint shards:  40%|████      | 2/5 [00:01<00:01,  1.86it/s]
Loading checkpoint shards:  40%|████      | 2/5 [00:01<00:01,  1.83it/s]
Loading checkpoint shards:  40%|████      | 2/5 [00:01<00:01,  1.85it/s]
Loading checkpoint shards:  40%|████      | 2/5 [00:01<00:01,  1.82it/s]
Loading checkpoint shards:  60%|██████    | 3/5 [00:01<00:01,  1.79it/s]
Loading checkpoint shards:  60%|██████    | 3/5 [00:01<00:01,  1.78it/s]
Loading checkpoint shards:  60%|██████    | 3/5 [00:01<00:01,  1.80it/s]
Loading checkpoint shards:  60%|██████    | 3/5 [00:01<00:01,  1.80it/s]
Loading checkpoint shards:  60%|██████    | 3/5 [00:01<00:01,  1.80it/s]
Loading checkpoint shards:  60%|██████    | 3/5 [00:01<00:01,  1.81it/s]
Loading checkpoint shards:  60%|██████    | 3/5 [00:01<00:01,  1.79it/s]
Loading checkpoint shards:  60%|██████    | 3/5 [00:01<00:01,  1.79it/s]
Loading checkpoint shards:  80%|████████  | 4/5 [00:02<00:00,  1.86it/s]
Loading checkpoint shards:  80%|████████  | 4/5 [00:02<00:00,  1.87it/s]
Loading checkpoint shards:  80%|████████  | 4/5 [00:02<00:00,  1.87it/s]
Loading checkpoint shards:  80%|████████  | 4/5 [00:02<00:00,  1.87it/s]
Loading checkpoint shards:  80%|████████  | 4/5 [00:02<00:00,  1.87it/s]
Loading checkpoint shards:  80%|████████  | 4/5 [00:02<00:00,  1.86it/s]
Loading checkpoint shards:  80%|████████  | 4/5 [00:02<00:00,  1.88it/s]
Loading checkpoint shards:  80%|████████  | 4/5 [00:02<00:00,  1.87it/s]
Loading checkpoint shards: 100%|██████████| 5/5 [00:02<00:00,  2.33it/s]

Loading checkpoint shards: 100%|██████████| 5/5 [00:02<00:00,  2.34it/s]
Loading checkpoint shards: 100%|██████████| 5/5 [00:02<00:00,  2.36it/s]


Loading checkpoint shards: 100%|██████████| 5/5 [00:02<00:00,  2.35it/s]
Loading checkpoint shards: 100%|██████████| 5/5 [00:02<00:00,  2.36it/s]

Loading checkpoint shards: 100%|██████████| 5/5 [00:02<00:00,  2.34it/s]

Loading checkpoint shards: 100%|██████████| 5/5 [00:02<00:00,  2.34it/s]
Loading checkpoint shards: 100%|██████████| 5/5 [00:02<00:00,  2.37it/s]


INFO:specforge.utils:Loaded draft config from /workspace/hanrui/SpecForge/configs/qwen3-8b-dflash.json
INFO:specforge.utils:Using attention backend: flex_attention
INFO:specforge.utils:Draft config: block_size=16, num_hidden_layers=5, num_target_layers=36
INFO:specforge.utils:Draft model parameters: 1,048,626,432
INFO:specforge.utils:Using mask_token_id: 151669
INFO:specforge.utils:dflash_config: {'mask_token_id': 151669, 'target_layer_ids': [1, 9, 17, 25, 33]}

Generating train split: 0 examples [00:00, ? examples/s]
Generating train split: 1415 examples [00:00, 8811.58 examples/s]
Generating train split: 3122 examples [00:00, 11082.04 examples/s]
Generating train split: 4875 examples [00:00, 12069.85 examples/s]
Generating train split: 6641 examples [00:00, 13076.22 examples/s]
Generating train split: 8414 examples [00:00, 13857.65 examples/s]
Generating train split: 10156 examples [00:00, 14143.44 examples/s]
Generating train split: 11890 examples [00:00, 14012.42 examples/s]
Generating train split: 13682 examples [00:01, 9161.16 examples/s] 
Generating train split: 15433 examples [00:01, 10065.15 examples/s]
Generating train split: 16762 examples [00:01, 9910.32 examples/s] 
Generating train split: 17694 examples [00:01, 9631.51 examples/s]
[rank0]: Traceback (most recent call last):
[rank0]:   File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 1890, in _prepare_split_single
[rank0]:     writer.write_table(table)
[rank0]:   File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/arrow_writer.py", line 680, in write_table
[rank0]:     self.pa_writer.write_table(pa_table, writer_batch_size)
[rank0]:   File "pyarrow/ipc.pxi", line 616, in pyarrow.lib._CRecordBatchWriter.write_table
[rank0]:   File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
[rank0]:   File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/fsspec/implementations/local.py", line 473, in write
[rank0]:     return self.f.write(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: OSError: [Errno 122] Disk quota exceeded

[rank0]: During handling of the above exception, another exception occurred:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 1911, in _prepare_split_single
[rank0]:     num_examples, num_bytes = writer.finalize()
[rank0]:                               ^^^^^^^^^^^^^^^^^
[rank0]:   File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/arrow_writer.py", line 693, in finalize
[rank0]:     self.stream.close()
[rank0]:   File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/fsspec/implementations/local.py", line 491, in close
[rank0]:     return self.f.close()
[rank0]:            ^^^^^^^^^^^^^^
[rank0]: OSError: [Errno 122] Disk quota exceeded

[rank0]: The above exception was the direct cause of the following exception:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/workspace/hanrui/SpecForge/scripts/train_dflash.py", line 562, in <module>
[rank0]:     main()
[rank0]:   File "/workspace/hanrui/SpecForge/scripts/train_dflash.py", line 415, in main
[rank0]:     train_dataloader, eval_dataloader = build_dataloader(args, tokenizer)
[rank0]:                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/workspace/hanrui/SpecForge/scripts/train_dflash.py", line 222, in build_dataloader
[rank0]:     train_dataset = load_dataset("json", data_files=args.train_data_path)["train"]
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/load.py", line 1505, in load_dataset
[rank0]:     builder_instance.download_and_prepare(
[rank0]:   File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 884, in download_and_prepare
[rank0]:     self._download_and_prepare(
[rank0]:   File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 947, in _download_and_prepare
[rank0]:     self._prepare_split(split_generator, **prepare_split_kwargs)
[rank0]:   File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 1739, in _prepare_split
[rank0]:     for job_id, done, content in self._prepare_split_single(
[rank0]:   File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 1922, in _prepare_split_single
[rank0]:     raise DatasetGenerationError("An error occurred while generating the dataset") from e
[rank0]: datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

Generating train split: 0 examples [00:00, ? examples/s]
Generating train split: 1837 examples [00:00, 11719.94 examples/s]
Generating train split: 3552 examples [00:00, 11699.05 examples/s]
Generating train split: 5305 examples [00:00, 12818.69 examples/s]
Generating train split: 7092 examples [00:00, 13468.71 examples/s][rank0]:[W405 10:15:29.947786516 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

Generating train split: 8810 examples [00:00, 13834.01 examples/s]
Generating train split: 10577 examples [00:00, 14054.90 examples/s]
Generating train split: 12339 examples [00:00, 12805.47 examples/s]
Generating train split: 13682 examples [00:01, 9900.64 examples/s] 
Generating train split: 14998 examples [00:01, 10026.49 examples/s]
Generating train split: 16762 examples [00:01, 8287.30 examples/s] 
Generating train split: 17694 examples [00:01, 7877.81 examples/s]
Generating train split: 17694 examples [00:01, 10199.12 examples/s]
[rank6]: Traceback (most recent call last):
[rank6]:   File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 1890, in _prepare_split_single
[rank6]:     writer.write_table(table)
[rank6]:   File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/arrow_writer.py", line 680, in write_table
[rank6]:     self.pa_writer.write_table(pa_table, writer_batch_size)
[rank6]:   File "pyarrow/ipc.pxi", line 616, in pyarrow.lib._CRecordBatchWriter.write_table
[rank6]:   File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
[rank6]:   File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/fsspec/implementations/local.py", line 473, in write
[rank6]:     return self.f.write(*args, **kwargs)
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: OSError: [Errno 122] Disk quota exceeded

[rank6]: During handling of the above exception, another exception occurred:

[rank6]: Traceback (most recent call last):
[rank6]:   File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 1911, in _prepare_split_single
[rank6]:     num_examples, num_bytes = writer.finalize()
[rank6]:                               ^^^^^^^^^^^^^^^^^
[rank6]:   File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/arrow_writer.py", line 693, in finalize
[rank6]:     self.stream.close()
[rank6]:   File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/fsspec/implementations/local.py", line 491, in close
[rank6]:     return self.f.close()
[rank6]:            ^^^^^^^^^^^^^^
[rank6]: OSError: [Errno 122] Disk quota exceeded

[rank6]: The above exception was the direct cause of the following exception:

[rank6]: Traceback (most recent call last):
[rank6]:   File "/workspace/hanrui/SpecForge/scripts/train_dflash.py", line 562, in <module>
[rank6]:     main()
[rank6]:   File "/workspace/hanrui/SpecForge/scripts/train_dflash.py", line 415, in main
[rank6]:     train_dataloader, eval_dataloader = build_dataloader(args, tokenizer)
[rank6]:                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/workspace/hanrui/SpecForge/scripts/train_dflash.py", line 222, in build_dataloader
[rank6]:     train_dataset = load_dataset("json", data_files=args.train_data_path)["train"]
[rank6]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/load.py", line 1505, in load_dataset
[rank6]:     builder_instance.download_and_prepare(
[rank6]:   File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 884, in download_and_prepare
[rank6]:     self._download_and_prepare(
[rank6]:   File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 947, in _download_and_prepare
[rank6]:     self._prepare_split(split_generator, **prepare_split_kwargs)
[rank6]:   File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 1739, in _prepare_split
[rank6]:     for job_id, done, content in self._prepare_split_single(
[rank6]:   File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/datasets/builder.py", line 1922, in _prepare_split_single
[rank6]:     raise DatasetGenerationError("An error occurred while generating the dataset") from e
[rank6]: datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

Generating train split: 0 examples [00:00, ? examples/s]
Generating train split: 1415 examples [00:00, 9349.59 examples/s]
Generating train split: 3122 examples [00:00, 10389.63 examples/s]W0405 10:15:30.839000 1134 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1221 closing signal SIGTERM
W0405 10:15:30.840000 1134 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1222 closing signal SIGTERM
W0405 10:15:30.841000 1134 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1223 closing signal SIGTERM
W0405 10:15:30.841000 1134 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1224 closing signal SIGTERM
W0405 10:15:30.842000 1134 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1225 closing signal SIGTERM
W0405 10:15:30.842000 1134 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1226 closing signal SIGTERM
W0405 10:15:30.843000 1134 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1227 closing signal SIGTERM
E0405 10:15:32.009000 1134 site-packages/torch/distributed/elastic/multiprocessing/api.py:882] failed (exitcode: 1) local_rank: 0 (pid: 1220) of binary: /workspace/miniconda3/envs/specforge/bin/python3.11
Traceback (most recent call last):
  File "/workspace/miniconda3/envs/specforge/bin/torchrun", line 6, in <module>
    sys.exit(main())
             ^^^^^^
  File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/run.py", line 936, in main
    run(args)
  File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/run.py", line 927, in run
    elastic_launch(
  File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 156, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 293, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/workspace/hanrui/SpecForge/scripts/train_dflash.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2026-04-05_10:15:30
  host      : job-006ce80a7c47-20260302193512-7694985998-5ng4c
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1220)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================