| nohup: ignoring input |
| W0405 10:25:31.190000 1866 site-packages/torch/distributed/run.py:803] |
| W0405 10:25:31.190000 1866 site-packages/torch/distributed/run.py:803] ***************************************** |
| W0405 10:25:31.190000 1866 site-packages/torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. |
| W0405 10:25:31.190000 1866 site-packages/torch/distributed/run.py:803] ***************************************** |
| Set TORCH_CUDA_ARCH_LIST to 9.0 |
| /workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend. |
| warnings.warn( |
| Set TORCH_CUDA_ARCH_LIST to 9.0 |
| /workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend. |
| warnings.warn( |
| Set TORCH_CUDA_ARCH_LIST to 9.0 |
| /workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend. |
| warnings.warn( |
| Set TORCH_CUDA_ARCH_LIST to 9.0 |
| Set TORCH_CUDA_ARCH_LIST to 9.0 |
| Set TORCH_CUDA_ARCH_LIST to 9.0 |
| Set TORCH_CUDA_ARCH_LIST to 9.0 |
| Set TORCH_CUDA_ARCH_LIST to 9.0 |
| /workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend. |
| warnings.warn( |
| /workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend. |
| warnings.warn( |
| /workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend. |
| warnings.warn( |
| /workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend. |
| warnings.warn( |
| /workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend. |
| warnings.warn( |
| <frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. |
| <frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. |
| <frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. |
| <frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. |
| <frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. |
| <frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. |
| <frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. |
| <frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. |
| <frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. |
| <frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. |
| <frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. |
| <frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. |
| <frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. |
| <frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. |
| <frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. |
| <frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. |
| INFO:specforge.utils:rank 7: bind to device 7 |
| INFO:specforge.utils:rank 7: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1)) |
| INFO:specforge.utils:rank 7: Initialized distributed |
| `torch_dtype` is deprecated! Use `dtype` instead! |
| The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details. |
| INFO:specforge.utils:rank 5: bind to device 5 |
| INFO:specforge.utils:rank 5: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1)) |
| INFO:specforge.utils:rank 5: Initialized distributed |
| `torch_dtype` is deprecated! Use `dtype` instead! |
| The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details. |
|
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 100%|ββββββββββ| 5/5 [00:00<00:00, 144.85it/s] |
|
Loading checkpoint shards: 100%|ββββββββββ| 5/5 [00:00<00:00, 146.67it/s] |
| INFO:specforge.utils:rank 2: bind to device 2 |
| INFO:specforge.utils:rank 2: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1)) |
| INFO:specforge.utils:rank 2: Initialized distributed |
| `torch_dtype` is deprecated! Use `dtype` instead! |
| The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details. |
|
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 100%|ββββββββββ| 5/5 [00:00<00:00, 144.95it/s] |
| INFO:specforge.utils:rank 6: bind to device 6 |
| INFO:specforge.utils:rank 6: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1)) |
| INFO:specforge.utils:rank 6: Initialized distributed |
| `torch_dtype` is deprecated! Use `dtype` instead! |
| INFO:specforge.utils:rank 4: bind to device 4 |
| INFO:specforge.utils:rank 4: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1)) |
| INFO:specforge.utils:rank 0: bind to device 0 |
| INFO:specforge.utils:rank 4: Initialized distributed |
| The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details. |
| `torch_dtype` is deprecated! Use `dtype` instead! |
| INFO:specforge.utils:rank 0: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1)) |
| INFO:specforge.utils:rank 0: Initialized distributed |
| INFO:specforge.utils:Loading target model from /workspace/models/Qwen3-8B using hf backend |
| `torch_dtype` is deprecated! Use `dtype` instead! |
| The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details. |
| INFO:specforge.utils:rank 1: bind to device 1 |
| The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details. |
| INFO:specforge.utils:rank 1: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1)) |
| INFO:specforge.utils:rank 1: Initialized distributed |
| `torch_dtype` is deprecated! Use `dtype` instead! |
| The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details. |
| INFO:specforge.utils:rank 3: bind to device 3 |
| INFO:specforge.utils:rank 3: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1)) |
| INFO:specforge.utils:rank 3: Initialized distributed |
| `torch_dtype` is deprecated! Use `dtype` instead! |
| The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details. |
|
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 100%|ββββββββββ| 5/5 [00:00<00:00, 147.05it/s] |
|
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 100%|ββββββββββ| 5/5 [00:00<00:00, 144.71it/s] |
|
Loading checkpoint shards: 100%|ββββββββββ| 5/5 [00:00<00:00, 147.19it/s] |
|
Loading checkpoint shards: 100%|ββββββββββ| 5/5 [00:00<00:00, 147.49it/s] |
|
Loading checkpoint shards: 100%|ββββββββββ| 5/5 [00:00<00:00, 144.23it/s] |
| INFO:specforge.utils:Loaded draft config from /workspace/hanrui/SpecForge/configs/qwen3-8b-dflash.json |
| INFO:specforge.utils:Using attention backend: flex_attention |
| INFO:specforge.utils:Draft config: block_size=16, num_hidden_layers=5, num_target_layers=36 |
| INFO:specforge.utils:Draft model parameters: 1,048,626,432 |
| INFO:specforge.utils:Using mask_token_id: 151669 |
| INFO:specforge.utils:dflash_config: {'mask_token_id': 151669, 'target_layer_ids': [1, 9, 17, 25, 33]} |
|
Generating train split: 0 examples [00:00, ? examples/s]
Generating train split: 1837 examples [00:00, 11687.06 examples/s]
Generating train split: 3552 examples [00:00, 12905.26 examples/s]
Generating train split: 5305 examples [00:00, 13685.64 examples/s]
Generating train split: 7092 examples [00:00, 14087.24 examples/s]
Generating train split: 8810 examples [00:00, 13875.88 examples/s]
Generating train split: 10577 examples [00:00, 14070.80 examples/s]
Generating train split: 12339 examples [00:00, 14338.59 examples/s]
Generating train split: 14119 examples [00:01, 14408.97 examples/s]
Generating train split: 15875 examples [00:01, 13772.55 examples/s]
Generating train split: 18146 examples [00:01, 13631.08 examples/s]
Generating train split: 19821 examples [00:01, 13593.93 examples/s]
Generating train split: 21639 examples [00:01, 14037.45 examples/s]
Generating train split: 23383 examples [00:01, 14050.69 examples/s]
Generating train split: 25099 examples [00:01, 14084.22 examples/s]
Generating train split: 26883 examples [00:01, 14187.82 examples/s]
Generating train split: 28585 examples [00:02, 13753.75 examples/s]
Generating train split: 30239 examples [00:02, 13572.18 examples/s]
Generating train split: 31983 examples [00:02, 13849.88 examples/s]
Generating train split: 33781 examples [00:02, 14154.00 examples/s]
Generating train split: 35574 examples [00:02, 14123.50 examples/s]
Generating train split: 37211 examples [00:02, 13928.80 examples/s]
Generating train split: 38849 examples [00:02, 13744.18 examples/s]
Generating train split: 40492 examples [00:02, 13641.21 examples/s]
Generating train split: 42163 examples [00:03, 13830.61 examples/s]
Generating train split: 43858 examples [00:03, 13117.61 examples/s]
Generating train split: 45529 examples [00:03, 13362.01 examples/s]
Generating train split: 47168 examples [00:03, 13406.32 examples/s]
Generating train split: 48845 examples [00:03, 13647.43 examples/s]
Generating train split: 50514 examples [00:03, 13685.47 examples/s]
Generating train split: 52177 examples [00:03, 13816.16 examples/s]
Generating train split: 53848 examples [00:03, 13338.72 examples/s]
Generating train split: 55490 examples [00:04, 13486.62 examples/s]
Generating train split: 57140 examples [00:04, 13073.50 examples/s]
Generating train split: 58765 examples [00:04, 13223.92 examples/s]
Generating train split: 60428 examples [00:04, 13284.92 examples/s]
Generating train split: 62103 examples [00:04, 13510.17 examples/s]
Generating train split: 63757 examples [00:04, 13534.27 examples/s]
Generating train split: 65373 examples [00:04, 13635.48 examples/s]
Generating train split: 67054 examples [00:04, 13778.71 examples/s]
Generating train split: 68728 examples [00:05, 13958.56 examples/s]
Generating train split: 70334 examples [00:05, 13449.56 examples/s]
Generating train split: 71933 examples [00:05, 13524.01 examples/s]
Generating train split: 73524 examples [00:05, 13487.56 examples/s]
Generating train split: 75146 examples [00:05, 13619.17 examples/s]
Generating train split: 76875 examples [00:05, 13802.52 examples/s]
Generating train split: 78532 examples [00:05, 13914.99 examples/s]
Generating train split: 78809 examples [00:05, 13690.41 examples/s] |
| dataset is cached at ./cache/processed_dataset/1ca66c4ec30f16c9add30cc4fc5f1b5a.pkldataset is cached at ./cache/processed_dataset/1ca66c4ec30f16c9add30cc4fc5f1b5a.pkl |
|
|
| dataset is cached at ./cache/processed_dataset/1ca66c4ec30f16c9add30cc4fc5f1b5a.pkl |
| dataset is cached at ./cache/processed_dataset/1ca66c4ec30f16c9add30cc4fc5f1b5a.pkldataset is cached at ./cache/processed_dataset/1ca66c4ec30f16c9add30cc4fc5f1b5a.pkl |
|
|
| dataset is cached at ./cache/processed_dataset/1ca66c4ec30f16c9add30cc4fc5f1b5a.pkl |
| dataset is cached at ./cache/processed_dataset/1ca66c4ec30f16c9add30cc4fc5f1b5a.pkl |
| dataset is cached at ./cache/processed_dataset/1ca66c4ec30f16c9add30cc4fc5f1b5a.pkl |
|
Map (num_proc=32): 0%| | 0/78809 [00:00<?, ? examples/s]
Map (num_proc=32): 0%| | 0/78809 [00:00<?, ? examples/s]
Map (num_proc=32): 0%| | 0/78809 [00:00<?, ? examples/s]
Map (num_proc=32): 0%| | 0/78809 [00:00<?, ? examples/s]
Map (num_proc=32): 0%| | 0/78809 [00:00<?, ? examples/s]
Map (num_proc=32): 0%| | 0/78809 [00:00<?, ? examples/s]
Map (num_proc=32): 0%| | 0/78809 [00:00<?, ? examples/s]
Map (num_proc=32): 0%| | 0/78809 [00:00<?, ? examples/s]W0405 10:41:54.414000 1866 site-packages/torch/distributed/elastic/agent/server/api.py:725] Received 15 death signal, shutting down workers |
| W0405 10:41:54.415000 1866 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1952 closing signal SIGTERM |
| W0405 10:41:54.613000 1866 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1953 closing signal SIGTERM |
| W0405 10:41:54.613000 1866 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1954 closing signal SIGTERM |
| W0405 10:41:54.614000 1866 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1955 closing signal SIGTERM |
| W0405 10:41:54.614000 1866 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1956 closing signal SIGTERM |
| W0405 10:41:54.614000 1866 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1957 closing signal SIGTERM |
| W0405 10:41:54.614000 1866 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1958 closing signal SIGTERM |
| W0405 10:41:54.615000 1866 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1959 closing signal SIGTERM |
| W0405 10:41:58.415000 1866 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1956 closing signal SIGTERM |
| W0405 10:41:58.416000 1866 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1957 closing signal SIGTERM |
| Traceback (most recent call last): |
| File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 717, in run |
| result = self._invoke_run(role) |
| ^^^^^^^^^^^^^^^^^^^^^^ |
| File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 881, in _invoke_run |
| time.sleep(monitor_interval) |
| File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 85, in _terminate_process_handler |
| raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) |
| torch.distributed.elastic.multiprocessing.api.SignalException: Process 1866 got signal: 15 |
|
|
| During handling of the above exception, another exception occurred: |
|
|
| Traceback (most recent call last): |
| File "/workspace/miniconda3/envs/specforge/bin/torchrun", line 6, in <module> |
| sys.exit(main()) |
| ^^^^^^ |
| File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper |
| return f(*args, **kwargs) |
| ^^^^^^^^^^^^^^^^^^ |
| File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/run.py", line 936, in main |
| run(args) |
| File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/run.py", line 927, in run |
| elastic_launch( |
| File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 156, in __call__ |
| return launch_agent(self._config, self._entrypoint, list(args)) |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 284, in launch_agent |
| result = agent.run() |
| ^^^^^^^^^^^ |
| File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 138, in wrapper |
| result = f(*args, **kwargs) |
| ^^^^^^^^^^^^^^^^^^ |
| File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 726, in run |
| self._shutdown(e.sigval) |
| File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 369, in _shutdown |
| self._pcontext.close(death_sig) |
| File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 578, in close |
| self._close(death_sig=death_sig, timeout=timeout) |
| File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 920, in _close |
| handler.proc.wait(time_to_wait) |
| File "/workspace/miniconda3/envs/specforge/lib/python3.11/subprocess.py", line 1264, in wait |
| return self._wait(timeout=timeout) |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| File "/workspace/miniconda3/envs/specforge/lib/python3.11/subprocess.py", line 2047, in _wait |
| time.sleep(delay) |
| File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 85, in _terminate_process_handler |
| raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) |
| torch.distributed.elastic.multiprocessing.api.SignalException: Process 1866 got signal: 15 |
| terminate called without an active exception |
| Fatal Python error: Aborted |
|
|
| Thread 0x00007f5a9a0a0740 (most recent call first): |
| <no Python frame> |
|
|
| Extension modules: numpy._core._multiarray_umath, numpy.linalg._umath_linalg, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special (total: 13) |
| examples/run_qwen3_8b_dflash_hf.sh: line 47: 1866 Aborted (core dumped) torchrun --standalone --nproc_per_node $NUM_GPUS $ROOT_DIR/scripts/train_dflash.py --target-model-path /workspace/models/Qwen3-8B --draft-config-path $ROOT_DIR/configs/qwen3-8b-dflash.json --train-data-path /workspace/hanrui/qwen3-8b_dflash_regen/sharegpt_train_regenerated.jsonl --output-dir $ROOT_DIR/outputs/qwen3-8b-dflash-hf --num-epochs 6 --batch-size 4 --learning-rate 6e-4 --warmup-ratio 0.04 --max-grad-norm 1.0 --max-length 3072 --chat-template qwen --attention-backend $ATTENTION_BACKEND --num-anchors 512 --loss-decay-gamma 7.0 --log-interval 50 --save-interval 1000 --report-to none --target-model-backend hf --block-size 16 --num-anchors 512 |
|
|