File size: 13,717 Bytes
d522318 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 | 2026-02-02 07:13:23,912 INFO MainThread:601 [wandb_setup.py:_flush():81] Current SDK version is 0.24.1
2026-02-02 07:13:23,912 INFO MainThread:601 [wandb_setup.py:_flush():81] Configure stats pid to 601
2026-02-02 07:13:23,912 INFO MainThread:601 [wandb_setup.py:_flush():81] Loading settings from environment variables
2026-02-02 07:13:23,912 INFO MainThread:601 [wandb_init.py:setup_run_log_directory():717] Logging user logs to /workspace/hanrui/SpecForge-ext/wandb/run-20260202_071323-2yze80jn/logs/debug.log
2026-02-02 07:13:23,912 INFO MainThread:601 [wandb_init.py:setup_run_log_directory():718] Logging internal logs to /workspace/hanrui/SpecForge-ext/wandb/run-20260202_071323-2yze80jn/logs/debug-internal.log
2026-02-02 07:13:23,912 INFO MainThread:601 [wandb_init.py:init():844] calling init triggers
2026-02-02 07:13:23,912 INFO MainThread:601 [wandb_init.py:init():849] wandb.init called with sweep_config: {}
config: {'target_model_path': '/workspace/Qwen3-8B', 'trust_remote_code': False, 'draft_model_config': 'configs/qwen3-8b-qwen3eagle-5layer.json', 'embedding_key': 'model.embed_tokens.weight', 'lm_head_key': 'lm_head.weight', 'is_vlm': False, 'target_model_backend': 'sglang', 'train_data_path': '/workspace/hanrui/qwen3-8b_dflash_regen/sharegpt_train_regenerated.jsonl', 'train_hidden_states_path': None, 'eval_hidden_states_path': None, 'eval_data_path': None, 'chat_template': 'qwen', 'is_preformatted': False, 'train_only_last_turn': False, 'build_dataset_num_proc': 8, 'dataloader_num_workers': 4, 'num_epochs': 10, 'max_num_steps': None, 'batch_size': 2, 'learning_rate': 0.0001, 'max_length': 2048, 'warmup_ratio': 0.015, 'total_steps': 49260, 'max_grad_norm': 0.5, 'ttt_length': 7, 'resume': False, 'ckpt_dir': None, 'eval_interval': 5000, 'save_interval': 5000, 'log_interval': 100, 'seed': 0, 'draft_accumulation_steps': 1, 'tp_size': 1, 'sp_ulysses_size': 1, 'sp_ring_size': 1, 'attention_backend': 'flex_attention', 'cache_key': None, 'cache_dir': 'cache', 'output_dir': 'outputs/qwen3-8b-qwen3eagle-5layer', 'verbose': False, 'dist_timeout': 20, 'model_download_dir': None, 'min_pixels': 50176, 'max_pixels': 802816, 'profile': False, 'profile_start_step': 30, 'profile_num_steps': 4, 'profile_record_shapes': False, 'sglang_attention_backend': 'flashinfer', 'sglang_mem_fraction_static': 0.4, 'sglang_context_length': None, 'sglang_enable_nccl_nvls': False, 'sglang_enable_symm_mem': False, 'sglang_enable_torch_compile': False, 'sglang_enable_dp_attention': False, 'sglang_enable_dp_lm_head': False, 'sglang_enable_piecewise_cuda_graph': False, 'sglang_piecewise_cuda_graph_max_tokens': 4096, 'sglang_piecewise_cuda_graph_tokens': None, 'sglang_ep_size': 1, 'report_to': 'wandb', 'wandb_project': 'qwen3-8b-qwen3eagle', 'wandb_name': '5layer-ttt7', 'wandb_key': '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', 'swanlab_project': None, 'swanlab_name': None, 'swanlab_key': None, 'mlflow_tracking_uri': None, 'mlflow_experiment_name': None, 'mlflow_run_name': None, 'dp_size': 8, 'target_batch_size': 2, '_wandb': {}}
2026-02-02 07:13:23,912 INFO MainThread:601 [wandb_init.py:init():892] starting backend
2026-02-02 07:13:24,247 INFO MainThread:601 [wandb_init.py:init():895] sending inform_init request
2026-02-02 07:13:24,263 INFO MainThread:601 [wandb_init.py:init():903] backend started and connected
2026-02-02 07:13:24,270 INFO MainThread:601 [wandb_init.py:init():973] updated telemetry
2026-02-02 07:13:24,285 INFO MainThread:601 [wandb_init.py:init():997] communicating run to backend with 90.0 second timeout
2026-02-02 07:13:55,052 INFO Thread-7 (wrapped_target):601 [retry.py:__call__():164] [no run ID] Retry attempt failed:
Traceback (most recent call last):
File "/workspace/specforge/lib/python3.11/site-packages/urllib3/connection.py", line 204, in _new_conn
sock = connection.create_connection(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/specforge/lib/python3.11/site-packages/urllib3/util/connection.py", line 85, in create_connection
raise err
File "/workspace/specforge/lib/python3.11/site-packages/urllib3/util/connection.py", line 73, in create_connection
sock.connect(sa)
TimeoutError: timed out
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/workspace/specforge/lib/python3.11/site-packages/urllib3/connectionpool.py", line 787, in urlopen
response = self._make_request(
^^^^^^^^^^^^^^^^^^^
File "/workspace/specforge/lib/python3.11/site-packages/urllib3/connectionpool.py", line 488, in _make_request
raise new_e
File "/workspace/specforge/lib/python3.11/site-packages/urllib3/connectionpool.py", line 464, in _make_request
self._validate_conn(conn)
File "/workspace/specforge/lib/python3.11/site-packages/urllib3/connectionpool.py", line 1093, in _validate_conn
conn.connect()
File "/workspace/specforge/lib/python3.11/site-packages/urllib3/connection.py", line 759, in connect
self.sock = sock = self._new_conn()
^^^^^^^^^^^^^^^^
File "/workspace/specforge/lib/python3.11/site-packages/urllib3/connection.py", line 213, in _new_conn
raise ConnectTimeoutError(
urllib3.exceptions.ConnectTimeoutError: (<HTTPSConnection(host='api.wandb.ai', port=443) at 0x7fcc1c1ea6d0>, 'Connection to api.wandb.ai timed out. (connect timeout=20)')
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/workspace/specforge/lib/python3.11/site-packages/requests/adapters.py", line 644, in send
resp = conn.urlopen(
^^^^^^^^^^^^^
File "/workspace/specforge/lib/python3.11/site-packages/urllib3/connectionpool.py", line 841, in urlopen
retries = retries.increment(
^^^^^^^^^^^^^^^^^^
File "/workspace/specforge/lib/python3.11/site-packages/urllib3/util/retry.py", line 535, in increment
raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='api.wandb.ai', port=443): Max retries exceeded with url: /graphql (Caused by ConnectTimeoutError(<HTTPSConnection(host='api.wandb.ai', port=443) at 0x7fcc1c1ea6d0>, 'Connection to api.wandb.ai timed out. (connect timeout=20)'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/workspace/specforge/lib/python3.11/site-packages/wandb/sdk/lib/retry.py", line 157, in __call__
result = self._call_fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/specforge/lib/python3.11/site-packages/wandb/sdk/internal/internal_api.py", line 397, in execute
return self.client.execute(*args, **kwargs) # type: ignore
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/specforge/lib/python3.11/site-packages/wandb/vendor/gql-0.2.0/wandb_gql/client.py", line 52, in execute
result = self._get_result(document, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/specforge/lib/python3.11/site-packages/wandb/vendor/gql-0.2.0/wandb_gql/client.py", line 60, in _get_result
return self.transport.execute(document, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/specforge/lib/python3.11/site-packages/wandb/sdk/lib/gql_request.py", line 70, in execute
request = self.session.post(self.url, **post_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/specforge/lib/python3.11/site-packages/requests/sessions.py", line 637, in post
return self.request("POST", url, data=data, json=json, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/specforge/lib/python3.11/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/specforge/lib/python3.11/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/specforge/lib/python3.11/site-packages/requests/adapters.py", line 665, in send
raise ConnectTimeout(e, request=request)
requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='api.wandb.ai', port=443): Max retries exceeded with url: /graphql (Caused by ConnectTimeoutError(<HTTPSConnection(host='api.wandb.ai', port=443) at 0x7fcc1c1ea6d0>, 'Connection to api.wandb.ai timed out. (connect timeout=20)'))
2026-02-02 07:14:12,432 INFO Thread-6 (wrapped_target):601 [retry.py:__call__():164] [no run ID] Retry attempt failed:
Traceback (most recent call last):
File "/workspace/specforge/lib/python3.11/site-packages/urllib3/connection.py", line 204, in _new_conn
sock = connection.create_connection(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/specforge/lib/python3.11/site-packages/urllib3/util/connection.py", line 85, in create_connection
raise err
File "/workspace/specforge/lib/python3.11/site-packages/urllib3/util/connection.py", line 73, in create_connection
sock.connect(sa)
TimeoutError: timed out
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/workspace/specforge/lib/python3.11/site-packages/urllib3/connectionpool.py", line 787, in urlopen
response = self._make_request(
^^^^^^^^^^^^^^^^^^^
File "/workspace/specforge/lib/python3.11/site-packages/urllib3/connectionpool.py", line 488, in _make_request
raise new_e
File "/workspace/specforge/lib/python3.11/site-packages/urllib3/connectionpool.py", line 464, in _make_request
self._validate_conn(conn)
File "/workspace/specforge/lib/python3.11/site-packages/urllib3/connectionpool.py", line 1093, in _validate_conn
conn.connect()
File "/workspace/specforge/lib/python3.11/site-packages/urllib3/connection.py", line 759, in connect
self.sock = sock = self._new_conn()
^^^^^^^^^^^^^^^^
File "/workspace/specforge/lib/python3.11/site-packages/urllib3/connection.py", line 213, in _new_conn
raise ConnectTimeoutError(
urllib3.exceptions.ConnectTimeoutError: (<HTTPSConnection(host='api.wandb.ai', port=443) at 0x7fcc1c1e8810>, 'Connection to api.wandb.ai timed out. (connect timeout=20)')
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/workspace/specforge/lib/python3.11/site-packages/requests/adapters.py", line 644, in send
resp = conn.urlopen(
^^^^^^^^^^^^^
File "/workspace/specforge/lib/python3.11/site-packages/urllib3/connectionpool.py", line 841, in urlopen
retries = retries.increment(
^^^^^^^^^^^^^^^^^^
File "/workspace/specforge/lib/python3.11/site-packages/urllib3/util/retry.py", line 535, in increment
raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='api.wandb.ai', port=443): Max retries exceeded with url: /graphql (Caused by ConnectTimeoutError(<HTTPSConnection(host='api.wandb.ai', port=443) at 0x7fcc1c1e8810>, 'Connection to api.wandb.ai timed out. (connect timeout=20)'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/workspace/specforge/lib/python3.11/site-packages/wandb/sdk/lib/retry.py", line 157, in __call__
result = self._call_fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/specforge/lib/python3.11/site-packages/wandb/sdk/internal/internal_api.py", line 397, in execute
return self.client.execute(*args, **kwargs) # type: ignore
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/specforge/lib/python3.11/site-packages/wandb/vendor/gql-0.2.0/wandb_gql/client.py", line 52, in execute
result = self._get_result(document, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/specforge/lib/python3.11/site-packages/wandb/vendor/gql-0.2.0/wandb_gql/client.py", line 60, in _get_result
return self.transport.execute(document, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/specforge/lib/python3.11/site-packages/wandb/sdk/lib/gql_request.py", line 70, in execute
request = self.session.post(self.url, **post_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/specforge/lib/python3.11/site-packages/requests/sessions.py", line 637, in post
return self.request("POST", url, data=data, json=json, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/specforge/lib/python3.11/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/specforge/lib/python3.11/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/specforge/lib/python3.11/site-packages/requests/adapters.py", line 665, in send
raise ConnectTimeout(e, request=request)
requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='api.wandb.ai', port=443): Max retries exceeded with url: /graphql (Caused by ConnectTimeoutError(<HTTPSConnection(host='api.wandb.ai', port=443) at 0x7fcc1c1e8810>, 'Connection to api.wandb.ai timed out. (connect timeout=20)'))
|