Hanrui / progress /log /dflash_hf_0405_1025.log

Add files using upload-large-folder tool

212a146 verified about 2 months ago

21.6 kB

	nohup: ignoring input
	W0405 10:25:31.190000 1866 site-packages/torch/distributed/run.py:803]
	W0405 10:25:31.190000 1866 site-packages/torch/distributed/run.py:803] *****************************************
	W0405 10:25:31.190000 1866 site-packages/torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
	W0405 10:25:31.190000 1866 site-packages/torch/distributed/run.py:803] *****************************************
	Set TORCH_CUDA_ARCH_LIST to 9.0
	/workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
	warnings.warn(
	Set TORCH_CUDA_ARCH_LIST to 9.0
	/workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
	warnings.warn(
	Set TORCH_CUDA_ARCH_LIST to 9.0
	/workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
	warnings.warn(
	Set TORCH_CUDA_ARCH_LIST to 9.0
	Set TORCH_CUDA_ARCH_LIST to 9.0
	Set TORCH_CUDA_ARCH_LIST to 9.0
	Set TORCH_CUDA_ARCH_LIST to 9.0
	Set TORCH_CUDA_ARCH_LIST to 9.0
	/workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
	warnings.warn(
	/workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
	warnings.warn(
	/workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
	warnings.warn(
	/workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
	warnings.warn(
	/workspace/hanrui/SpecForge/specforge/modeling/draft/llama3_eagle.py:29: UserWarning: flash_attn is not found, falling back to flex_attention. Please install flash_attn if you want to use the flash attention backend.
	warnings.warn(
	<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
	<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
	<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
	<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
	<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
	<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
	<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
	<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
	<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
	<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
	<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
	<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
	<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
	<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
	<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
	<frozen importlib._bootstrap_external>:1241: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
	INFO:specforge.utils:rank 7: bind to device 7
	INFO:specforge.utils:rank 7: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
	INFO:specforge.utils:rank 7: Initialized distributed
	`torch_dtype` is deprecated! Use `dtype` instead!
	The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
	INFO:specforge.utils:rank 5: bind to device 5
	INFO:specforge.utils:rank 5: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
	INFO:specforge.utils:rank 5: Initialized distributed
	`torch_dtype` is deprecated! Use `dtype` instead!
	The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
	Loading checkpoint shards: 0%\| \| 0/5 [00:00<?, ?it/s] Loading checkpoint shards: 0%\| \| 0/5 [00:00<?, ?it/s] Loading checkpoint shards: 100%\|██████████\| 5/5 [00:00<00:00, 144.85it/s]
	Loading checkpoint shards: 100%\|██████████\| 5/5 [00:00<00:00, 146.67it/s]
	INFO:specforge.utils:rank 2: bind to device 2
	INFO:specforge.utils:rank 2: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
	INFO:specforge.utils:rank 2: Initialized distributed
	`torch_dtype` is deprecated! Use `dtype` instead!
	The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
	Loading checkpoint shards: 0%\| \| 0/5 [00:00<?, ?it/s] Loading checkpoint shards: 100%\|██████████\| 5/5 [00:00<00:00, 144.95it/s]
	INFO:specforge.utils:rank 6: bind to device 6
	INFO:specforge.utils:rank 6: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
	INFO:specforge.utils:rank 6: Initialized distributed
	`torch_dtype` is deprecated! Use `dtype` instead!
	INFO:specforge.utils:rank 4: bind to device 4
	INFO:specforge.utils:rank 4: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
	INFO:specforge.utils:rank 0: bind to device 0
	INFO:specforge.utils:rank 4: Initialized distributed
	The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
	`torch_dtype` is deprecated! Use `dtype` instead!
	INFO:specforge.utils:rank 0: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
	INFO:specforge.utils:rank 0: Initialized distributed
	INFO:specforge.utils:Loading target model from /workspace/models/Qwen3-8B using hf backend
	`torch_dtype` is deprecated! Use `dtype` instead!
	The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
	INFO:specforge.utils:rank 1: bind to device 1
	The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
	INFO:specforge.utils:rank 1: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
	INFO:specforge.utils:rank 1: Initialized distributed
	`torch_dtype` is deprecated! Use `dtype` instead!
	The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
	INFO:specforge.utils:rank 3: bind to device 3
	INFO:specforge.utils:rank 3: device mesh: DeviceMesh((dp=8, tp=1), device: 'cuda', stride: (1, 1))
	INFO:specforge.utils:rank 3: Initialized distributed
	`torch_dtype` is deprecated! Use `dtype` instead!
	The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
	Loading checkpoint shards: 0%\| \| 0/5 [00:00<?, ?it/s] Loading checkpoint shards: 0%\| \| 0/5 [00:00<?, ?it/s] Loading checkpoint shards: 0%\| \| 0/5 [00:00<?, ?it/s] Loading checkpoint shards: 0%\| \| 0/5 [00:00<?, ?it/s] Loading checkpoint shards: 100%\|██████████\| 5/5 [00:00<00:00, 147.05it/s]
	Loading checkpoint shards: 0%\| \| 0/5 [00:00<?, ?it/s] Loading checkpoint shards: 100%\|██████████\| 5/5 [00:00<00:00, 144.71it/s]
	Loading checkpoint shards: 100%\|██████████\| 5/5 [00:00<00:00, 147.19it/s]
	Loading checkpoint shards: 100%\|██████████\| 5/5 [00:00<00:00, 147.49it/s]
	Loading checkpoint shards: 100%\|██████████\| 5/5 [00:00<00:00, 144.23it/s]
	INFO:specforge.utils:Loaded draft config from /workspace/hanrui/SpecForge/configs/qwen3-8b-dflash.json
	INFO:specforge.utils:Using attention backend: flex_attention
	INFO:specforge.utils:Draft config: block_size=16, num_hidden_layers=5, num_target_layers=36
	INFO:specforge.utils:Draft model parameters: 1,048,626,432
	INFO:specforge.utils:Using mask_token_id: 151669
	INFO:specforge.utils:dflash_config: {'mask_token_id': 151669, 'target_layer_ids': [1, 9, 17, 25, 33]}
	Generating train split: 0 examples [00:00, ? examples/s] Generating train split: 1837 examples [00:00, 11687.06 examples/s] Generating train split: 3552 examples [00:00, 12905.26 examples/s] Generating train split: 5305 examples [00:00, 13685.64 examples/s] Generating train split: 7092 examples [00:00, 14087.24 examples/s] Generating train split: 8810 examples [00:00, 13875.88 examples/s] Generating train split: 10577 examples [00:00, 14070.80 examples/s] Generating train split: 12339 examples [00:00, 14338.59 examples/s] Generating train split: 14119 examples [00:01, 14408.97 examples/s] Generating train split: 15875 examples [00:01, 13772.55 examples/s] Generating train split: 18146 examples [00:01, 13631.08 examples/s] Generating train split: 19821 examples [00:01, 13593.93 examples/s] Generating train split: 21639 examples [00:01, 14037.45 examples/s] Generating train split: 23383 examples [00:01, 14050.69 examples/s] Generating train split: 25099 examples [00:01, 14084.22 examples/s] Generating train split: 26883 examples [00:01, 14187.82 examples/s] Generating train split: 28585 examples [00:02, 13753.75 examples/s] Generating train split: 30239 examples [00:02, 13572.18 examples/s] Generating train split: 31983 examples [00:02, 13849.88 examples/s] Generating train split: 33781 examples [00:02, 14154.00 examples/s] Generating train split: 35574 examples [00:02, 14123.50 examples/s] Generating train split: 37211 examples [00:02, 13928.80 examples/s] Generating train split: 38849 examples [00:02, 13744.18 examples/s] Generating train split: 40492 examples [00:02, 13641.21 examples/s] Generating train split: 42163 examples [00:03, 13830.61 examples/s] Generating train split: 43858 examples [00:03, 13117.61 examples/s] Generating train split: 45529 examples [00:03, 13362.01 examples/s] Generating train split: 47168 examples [00:03, 13406.32 examples/s] Generating train split: 48845 examples [00:03, 13647.43 examples/s] Generating train split: 50514 examples [00:03, 13685.47 examples/s] Generating train split: 52177 examples [00:03, 13816.16 examples/s] Generating train split: 53848 examples [00:03, 13338.72 examples/s] Generating train split: 55490 examples [00:04, 13486.62 examples/s] Generating train split: 57140 examples [00:04, 13073.50 examples/s] Generating train split: 58765 examples [00:04, 13223.92 examples/s] Generating train split: 60428 examples [00:04, 13284.92 examples/s] Generating train split: 62103 examples [00:04, 13510.17 examples/s] Generating train split: 63757 examples [00:04, 13534.27 examples/s] Generating train split: 65373 examples [00:04, 13635.48 examples/s] Generating train split: 67054 examples [00:04, 13778.71 examples/s] Generating train split: 68728 examples [00:05, 13958.56 examples/s] Generating train split: 70334 examples [00:05, 13449.56 examples/s] Generating train split: 71933 examples [00:05, 13524.01 examples/s] Generating train split: 73524 examples [00:05, 13487.56 examples/s] Generating train split: 75146 examples [00:05, 13619.17 examples/s] Generating train split: 76875 examples [00:05, 13802.52 examples/s] Generating train split: 78532 examples [00:05, 13914.99 examples/s] Generating train split: 78809 examples [00:05, 13690.41 examples/s]
	dataset is cached at ./cache/processed_dataset/1ca66c4ec30f16c9add30cc4fc5f1b5a.pkldataset is cached at ./cache/processed_dataset/1ca66c4ec30f16c9add30cc4fc5f1b5a.pkl

	dataset is cached at ./cache/processed_dataset/1ca66c4ec30f16c9add30cc4fc5f1b5a.pkl
	dataset is cached at ./cache/processed_dataset/1ca66c4ec30f16c9add30cc4fc5f1b5a.pkldataset is cached at ./cache/processed_dataset/1ca66c4ec30f16c9add30cc4fc5f1b5a.pkl

	dataset is cached at ./cache/processed_dataset/1ca66c4ec30f16c9add30cc4fc5f1b5a.pkl
	dataset is cached at ./cache/processed_dataset/1ca66c4ec30f16c9add30cc4fc5f1b5a.pkl
	dataset is cached at ./cache/processed_dataset/1ca66c4ec30f16c9add30cc4fc5f1b5a.pkl
	Map (num_proc=32): 0%\| \| 0/78809 [00:00<?, ? examples/s] Map (num_proc=32): 0%\| \| 0/78809 [00:00<?, ? examples/s] Map (num_proc=32): 0%\| \| 0/78809 [00:00<?, ? examples/s] Map (num_proc=32): 0%\| \| 0/78809 [00:00<?, ? examples/s] Map (num_proc=32): 0%\| \| 0/78809 [00:00<?, ? examples/s] Map (num_proc=32): 0%\| \| 0/78809 [00:00<?, ? examples/s] Map (num_proc=32): 0%\| \| 0/78809 [00:00<?, ? examples/s] Map (num_proc=32): 0%\| \| 0/78809 [00:00<?, ? examples/s]W0405 10:41:54.414000 1866 site-packages/torch/distributed/elastic/agent/server/api.py:725] Received 15 death signal, shutting down workers
	W0405 10:41:54.415000 1866 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1952 closing signal SIGTERM
	W0405 10:41:54.613000 1866 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1953 closing signal SIGTERM
	W0405 10:41:54.613000 1866 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1954 closing signal SIGTERM
	W0405 10:41:54.614000 1866 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1955 closing signal SIGTERM
	W0405 10:41:54.614000 1866 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1956 closing signal SIGTERM
	W0405 10:41:54.614000 1866 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1957 closing signal SIGTERM
	W0405 10:41:54.614000 1866 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1958 closing signal SIGTERM
	W0405 10:41:54.615000 1866 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1959 closing signal SIGTERM
	W0405 10:41:58.415000 1866 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1956 closing signal SIGTERM
	W0405 10:41:58.416000 1866 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 1957 closing signal SIGTERM
	Traceback (most recent call last):
	File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 717, in run
	result = self._invoke_run(role)
	^^^^^^^^^^^^^^^^^^^^^^
	File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 881, in _invoke_run
	time.sleep(monitor_interval)
	File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 85, in _terminate_process_handler
	raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
	torch.distributed.elastic.multiprocessing.api.SignalException: Process 1866 got signal: 15

	During handling of the above exception, another exception occurred:

	Traceback (most recent call last):
	File "/workspace/miniconda3/envs/specforge/bin/torchrun", line 6, in <module>
	sys.exit(main())
	^^^^^^
	File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
	return f(args, *kwargs)
	^^^^^^^^^^^^^^^^^^
	File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/run.py", line 936, in main
	run(args)
	File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/run.py", line 927, in run
	elastic_launch(
	File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 156, in __call__
	return launch_agent(self._config, self._entrypoint, list(args))
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 284, in launch_agent
	result = agent.run()
	^^^^^^^^^^^
	File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 138, in wrapper
	result = f(args, *kwargs)
	^^^^^^^^^^^^^^^^^^
	File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 726, in run
	self._shutdown(e.sigval)
	File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 369, in _shutdown
	self._pcontext.close(death_sig)
	File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 578, in close
	self._close(death_sig=death_sig, timeout=timeout)
	File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 920, in _close
	handler.proc.wait(time_to_wait)
	File "/workspace/miniconda3/envs/specforge/lib/python3.11/subprocess.py", line 1264, in wait
	return self._wait(timeout=timeout)
	^^^^^^^^^^^^^^^^^^^^^^^^^^^
	File "/workspace/miniconda3/envs/specforge/lib/python3.11/subprocess.py", line 2047, in _wait
	time.sleep(delay)
	File "/workspace/miniconda3/envs/specforge/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 85, in _terminate_process_handler
	raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
	torch.distributed.elastic.multiprocessing.api.SignalException: Process 1866 got signal: 15
	terminate called without an active exception
	Fatal Python error: Aborted

	Thread 0x00007f5a9a0a0740 (most recent call first):
	<no Python frame>

	Extension modules: numpy._core._multiarray_umath, numpy.linalg._umath_linalg, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special (total: 13)
	examples/run_qwen3_8b_dflash_hf.sh: line 47: 1866 Aborted (core dumped) torchrun --standalone --nproc_per_node $NUM_GPUS $ROOT_DIR/scripts/train_dflash.py --target-model-path /workspace/models/Qwen3-8B --draft-config-path $ROOT_DIR/configs/qwen3-8b-dflash.json --train-data-path /workspace/hanrui/qwen3-8b_dflash_regen/sharegpt_train_regenerated.jsonl --output-dir $ROOT_DIR/outputs/qwen3-8b-dflash-hf --num-epochs 6 --batch-size 4 --learning-rate 6e-4 --warmup-ratio 0.04 --max-grad-norm 1.0 --max-length 3072 --chat-template qwen --attention-backend $ATTENTION_BACKEND --num-anchors 512 --loss-decay-gamma 7.0 --log-interval 50 --save-interval 1000 --report-to none --target-model-backend hf --block-size 16 --num-anchors 512