File size: 21,606 Bytes
212a146
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
nohup: ignoring input
W0405 10:25:31.190000 1866 site-packages/torch/distributed/run.py:803] 
W0405 10:25:31.190000 1866 site-packages/torch/distributed/run.py:803] *****************************************
W0405 10:25:31.190000 1866 site-packages/torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0405 10:25:31.190000 1866 site-packages/torch/distributed/run.py:803] *****************************************
Set TORCH_CUDA_ARCH_LIST to 9.0
/workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
  warnings.warn(
Set TORCH_CUDA_ARCH_LIST to 9.0
/workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
  warnings.warn(
Set TORCH_CUDA_ARCH_LIST to 9.0
/workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
  warnings.warn(
Set TORCH_CUDA_ARCH_LIST to 9.0
Set TORCH_CUDA_ARCH_LIST to 9.0
Set TORCH_CUDA_ARCH_LIST to 9.0
Set TORCH_CUDA_ARCH_LIST to 9.0
Set TORCH_CUDA_ARCH_LIST to 9.0
/workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
  warnings.warn(
/workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
  warnings.warn(
/workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
  warnings.warn(
/workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
  warnings.warn(
/workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
  warnings.warn(
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
INFO:specforge.utils:rank 7: bind to device 7
INFO:specforge.utils:rank 7: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
INFO:specforge.utils:rank 7: Initialized distributed
`torch_dtype` is deprecated! Use `dtype` instead!
The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
INFO:specforge.utils:rank 5: bind to device 5
INFO:specforge.utils:rank 5: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
INFO:specforge.utils:rank 5: Initialized distributed
`torch_dtype` is deprecated! Use `dtype` instead!
The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5/5 [00:00<00:00, 144.85it/s]

Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5/5 [00:00<00:00, 146.67it/s]
INFO:specforge.utils:rank 2: bind to device 2
INFO:specforge.utils:rank 2: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
INFO:specforge.utils:rank 2: Initialized distributed
`torch_dtype` is deprecated! Use `dtype` instead!
The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5/5 [00:00<00:00, 144.95it/s]
INFO:specforge.utils:rank 6: bind to device 6
INFO:specforge.utils:rank 6: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
INFO:specforge.utils:rank 6: Initialized distributed
`torch_dtype` is deprecated! Use `dtype` instead!
INFO:specforge.utils:rank 4: bind to device 4
INFO:specforge.utils:rank 4: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
INFO:specforge.utils:rank 0: bind to device 0
INFO:specforge.utils:rank 4: Initialized distributed
The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
`torch_dtype` is deprecated! Use `dtype` instead!
INFO:specforge.utils:rank 0: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
INFO:specforge.utils:rank 0: Initialized distributed
INFO:specforge.utils:Loading target model from /workspace/models/Qwen3-8B using hf backend
`torch_dtype` is deprecated! Use `dtype` instead!
The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
INFO:specforge.utils:rank 1: bind to device 1
The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
INFO:specforge.utils:rank 1: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
INFO:specforge.utils:rank 1: Initialized distributed
`torch_dtype` is deprecated! Use `dtype` instead!
The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
INFO:specforge.utils:rank 3: bind to device 3
INFO:specforge.utils:rank 3: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
INFO:specforge.utils:rank 3: Initialized distributed
`torch_dtype` is deprecated! Use `dtype` instead!
The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5/5 [00:00<00:00, 147.05it/s]

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5/5 [00:00<00:00, 144.71it/s]

Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5/5 [00:00<00:00, 147.19it/s]

Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5/5 [00:00<00:00, 147.49it/s]

Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5/5 [00:00<00:00, 144.23it/s]
INFO:specforge.utils:Loaded draft config from /workspace/hanrui/SpecForge/configs/qwen3-8b-dflash.json
INFO:specforge.utils:Using attention backend: flex_attention
INFO:specforge.utils:Draft config: block_size=16, num_hidden_layers=5, num_target_layers=36
INFO:specforge.utils:Draft model parameters: 1,048,626,432
INFO:specforge.utils:Using mask_token_id: 151669
INFO:specforge.utils:dflash_config: {'mask_token_id': 151669, 'target_layer_ids': [1, 9, 17, 25, 33]}

Generating train split: 0 examples [00:00, ? examples/s]
Generating train split: 1837 examples [00:00, 11687.06 examples/s]
Generating train split: 3552 examples [00:00, 12905.26 examples/s]
Generating train split: 5305 examples [00:00, 13685.64 examples/s]
Generating train split: 7092 examples [00:00, 14087.24 examples/s]
Generating train split: 8810 examples [00:00, 13875.88 examples/s]
Generating train split: 10577 examples [00:00, 14070.80 examples/s]
Generating train split: 12339 examples [00:00, 14338.59 examples/s]
Generating train split: 14119 examples [00:01, 14408.97 examples/s]
Generating train split: 15875 examples [00:01, 13772.55 examples/s]
Generating train split: 18146 examples [00:01, 13631.08 examples/s]
Generating train split: 19821 examples [00:01, 13593.93 examples/s]
Generating train split: 21639 examples [00:01, 14037.45 examples/s]
Generating train split: 23383 examples [00:01, 14050.69 examples/s]
Generating train split: 25099 examples [00:01, 14084.22 examples/s]
Generating train split: 26883 examples [00:01, 14187.82 examples/s]
Generating train split: 28585 examples [00:02, 13753.75 examples/s]
Generating train split: 30239 examples [00:02, 13572.18 examples/s]
Generating train split: 31983 examples [00:02, 13849.88 examples/s]
Generating train split: 33781 examples [00:02, 14154.00 examples/s]
Generating train split: 35574 examples [00:02, 14123.50 examples/s]
Generating train split: 37211 examples [00:02, 13928.80 examples/s]
Generating train split: 38849 examples [00:02, 13744.18 examples/s]
Generating train split: 40492 examples [00:02, 13641.21 examples/s]
Generating train split: 42163 examples [00:03, 13830.61 examples/s]
Generating train split: 43858 examples [00:03, 13117.61 examples/s]
Generating train split: 45529 examples [00:03, 13362.01 examples/s]
Generating train split: 47168 examples [00:03, 13406.32 examples/s]
Generating train split: 48845 examples [00:03, 13647.43 examples/s]
Generating train split: 50514 examples [00:03, 13685.47 examples/s]
Generating train split: 52177 examples [00:03, 13816.16 examples/s]
Generating train split: 53848 examples [00:03, 13338.72 examples/s]
Generating train split: 55490 examples [00:04, 13486.62 examples/s]
Generating train split: 57140 examples [00:04, 13073.50 examples/s]
Generating train split: 58765 examples [00:04, 13223.92 examples/s]
Generating train split: 60428 examples [00:04, 13284.92 examples/s]
Generating train split: 62103 examples [00:04, 13510.17 examples/s]
Generating train split: 63757 examples [00:04, 13534.27 examples/s]
Generating train split: 65373 examples [00:04, 13635.48 examples/s]
Generating train split: 67054 examples [00:04, 13778.71 examples/s]
Generating train split: 68728 examples [00:05, 13958.56 examples/s]
Generating train split: 70334 examples [00:05, 13449.56 examples/s]
Generating train split: 71933 examples [00:05, 13524.01 examples/s]
Generating train split: 73524 examples [00:05, 13487.56 examples/s]
Generating train split: 75146 examples [00:05, 13619.17 examples/s]
Generating train split: 76875 examples [00:05, 13802.52 examples/s]
Generating train split: 78532 examples [00:05, 13914.99 examples/s]
Generating train split: 78809 examples [00:05, 13690.41 examples/s]
dataset is cached at ./cache/processed_dataset/1ca66c4ec30f16c9add30cc4fc5f1b5a.pkldataset is cached at ./cache/processed_dataset/1ca66c4ec30f16c9add30cc4fc5f1b5a.pkl

dataset is cached at ./cache/processed_dataset/1ca66c4ec30f16c9add30cc4fc5f1b5a.pkl
dataset is cached at ./cache/processed_dataset/1ca66c4ec30f16c9add30cc4fc5f1b5a.pkldataset is cached at ./cache/processed_dataset/1ca66c4ec30f16c9add30cc4fc5f1b5a.pkl

dataset is cached at ./cache/processed_dataset/1ca66c4ec30f16c9add30cc4fc5f1b5a.pkl
dataset is cached at ./cache/processed_dataset/1ca66c4ec30f16c9add30cc4fc5f1b5a.pkl
dataset is cached at ./cache/processed_dataset/1ca66c4ec30f16c9add30cc4fc5f1b5a.pkl

Map (num_proc=32):   0%|          | 0/78809 [00:00<?, ? examples/s]
Map (num_proc=32):   0%|          | 0/78809 [00:00<?, ? examples/s]
Map (num_proc=32):   0%|          | 0/78809 [00:00<?, ? examples/s]
Map (num_proc=32):   0%|          | 0/78809 [00:00<?, ? examples/s]
Map (num_proc=32):   0%|          | 0/78809 [00:00<?, ? examples/s]
Map (num_proc=32):   0%|          | 0/78809 [00:00<?, ? examples/s]
Map (num_proc=32):   0%|          | 0/78809 [00:00<?, ? examples/s]
Map (num_proc=32):   0%|          | 0/78809 [00:00<?, ? examples/s]W0405 10:41:54.414000 1866 site-packages/torch/distributed/elastic/agent/server/api.py:725] Received 15 death signal, shutting down workers
W0405 10:41:54.415000 1866 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1952 closing signal SIGTERM
W0405 10:41:54.613000 1866 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1953 closing signal SIGTERM
W0405 10:41:54.613000 1866 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1954 closing signal SIGTERM
W0405 10:41:54.614000 1866 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1955 closing signal SIGTERM
W0405 10:41:54.614000 1866 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1956 closing signal SIGTERM
W0405 10:41:54.614000 1866 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1957 closing signal SIGTERM
W0405 10:41:54.614000 1866 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1958 closing signal SIGTERM
W0405 10:41:54.615000 1866 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1959 closing signal SIGTERM
W0405 10:41:58.415000 1866 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1956 closing signal SIGTERM
W0405 10:41:58.416000 1866 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1957 closing signal SIGTERM
Traceback (most recent call last):
  File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 717, in run
    result = self._invoke_run(role)
             ^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 881, in _invoke_run
    time.sleep(monitor_interval)
  File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 85, in _terminate_process_handler
    raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 1866 got signal: 15

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/workspace/miniconda3/envs/specforge/bin/torchrun", line 6, in <module>
    sys.exit(main())
             ^^^^^^
  File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/run.py", line 936, in main
    run(args)
  File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/run.py", line 927, in run
    elastic_launch(
  File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 156, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 284, in launch_agent
    result = agent.run()
             ^^^^^^^^^^^
  File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 138, in wrapper
    result = f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
  File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 726, in run
    self._shutdown(e.sigval)
  File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 369, in _shutdown
    self._pcontext.close(death_sig)
  File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 578, in close
    self._close(death_sig=death_sig, timeout=timeout)
  File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 920, in _close
    handler.proc.wait(time_to_wait)
  File "/workspace/miniconda3/envs/specforge/lib/python3.11/subprocess.py", line 1264, in wait
    return self._wait(timeout=timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/miniconda3/envs/specforge/lib/python3.11/subprocess.py", line 2047, in _wait
    time.sleep(delay)
  File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 85, in _terminate_process_handler
    raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 1866 got signal: 15
terminate called without an active exception
Fatal Python error: Aborted

Thread 0x00007f5a9a0a0740 (most recent call first):
  <no Python frame>

Extension modules: numpy._core._multiarray_umath, numpy.linalg._umath_linalg, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special (total: 13)
examples/run_qwen3_8b_dflash_hf.sh: line 47:  1866 Aborted                 (core dumped) torchrun --standalone --nproc_per_node $NUM_GPUS $ROOT_DIR/scripts/train_dflash.py --target-model-path /workspace/models/Qwen3-8B --draft-config-path $ROOT_DIR/configs/qwen3-8b-dflash.json --train-data-path /workspace/hanrui/qwen3-8b_dflash_regen/sharegpt_train_regenerated.jsonl --output-dir $ROOT_DIR/outputs/qwen3-8b-dflash-hf --num-epochs 6 --batch-size 4 --learning-rate 6e-4 --warmup-ratio 0.04 --max-grad-norm 1.0 --max-length 3072 --chat-template qwen --attention-backend $ATTENTION_BACKEND --num-anchors 512 --loss-decay-gamma 7.0 --log-interval 50 --save-interval 1000 --report-to none --target-model-backend hf --block-size 16 --num-anchors 512