File size: 22,086 Bytes
3650590
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
[2025-05-03 19:17:10,547] torch.distributed.run: [WARNING] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
[2025-05-03 19:17:10,547] torch.distributed.run: [WARNING] 
[2025-05-03 19:17:10,547] torch.distributed.run: [WARNING] *****************************************
[2025-05-03 19:17:10,547] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2025-05-03 19:17:10,547] torch.distributed.run: [WARNING] *****************************************
[2025-05-03 19:17:13] Experiment directory created at logs/nwm_cdit_m
[2025-05-03 19:17:27] CDiT Parameters: 1,011,959,456
[2025-05-03 19:17:28] Dataset contains 132,929 images
[2025-05-03 19:17:28] Training for 300 epochs...
[2025-05-03 19:17:28] Beginning epoch 0...
[2025-05-03 19:20:24] (step=0000100) Train Loss: 0.3427, Train Steps/Sec: 0.57, Samples/Sec: 27.26
[2025-05-03 19:21:10] (step=0000200) Train Loss: 0.2083, Train Steps/Sec: 2.15, Samples/Sec: 103.05
[2025-05-03 19:21:57] (step=0000300) Train Loss: 0.1963, Train Steps/Sec: 2.15, Samples/Sec: 103.02
[2025-05-03 19:22:45] (step=0000400) Train Loss: 0.1902, Train Steps/Sec: 2.10, Samples/Sec: 100.83
[2025-05-03 19:23:31] (step=0000500) Train Loss: 0.1773, Train Steps/Sec: 2.14, Samples/Sec: 102.95
[2025-05-03 19:24:18] (step=0000600) Train Loss: 0.1827, Train Steps/Sec: 2.15, Samples/Sec: 103.03
[2025-05-03 19:25:04] (step=0000700) Train Loss: 0.1773, Train Steps/Sec: 2.14, Samples/Sec: 102.95
[2025-05-03 19:25:51] (step=0000800) Train Loss: 0.1689, Train Steps/Sec: 2.13, Samples/Sec: 102.45
[2025-05-03 19:26:38] (step=0000900) Train Loss: 0.1784, Train Steps/Sec: 2.15, Samples/Sec: 102.99
[2025-05-03 19:27:25] (step=0001000) Train Loss: 0.1725, Train Steps/Sec: 2.13, Samples/Sec: 102.40
[2025-05-03 19:28:12] (step=0001100) Train Loss: 0.1645, Train Steps/Sec: 2.13, Samples/Sec: 102.45
[2025-05-03 19:28:58] (step=0001200) Train Loss: 0.1716, Train Steps/Sec: 2.13, Samples/Sec: 102.41
[2025-05-03 19:29:45] (step=0001300) Train Loss: 0.1750, Train Steps/Sec: 2.15, Samples/Sec: 103.04
[2025-05-03 19:30:32] (step=0001400) Train Loss: 0.1631, Train Steps/Sec: 2.15, Samples/Sec: 102.98
[2025-05-03 19:31:19] (step=0001500) Train Loss: 0.1667, Train Steps/Sec: 2.12, Samples/Sec: 101.82
[2025-05-03 19:32:06] (step=0001600) Train Loss: 0.1680, Train Steps/Sec: 2.15, Samples/Sec: 102.99
[2025-05-03 19:32:52] (step=0001700) Train Loss: 0.1665, Train Steps/Sec: 2.15, Samples/Sec: 103.03
[2025-05-03 19:33:39] (step=0001800) Train Loss: 0.1602, Train Steps/Sec: 2.15, Samples/Sec: 102.99
[2025-05-03 19:34:26] (step=0001900) Train Loss: 0.1718, Train Steps/Sec: 2.12, Samples/Sec: 101.97
[2025-05-03 19:35:12] (step=0002000) Train Loss: 0.1734, Train Steps/Sec: 2.15, Samples/Sec: 102.98
[2025-05-03 19:35:29] Saved checkpoint to logs/nwm_cdit_m/checkpoints/latest.pth.tar
[2025-05-03 19:36:16] (step=0002100) Train Loss: 0.1608, Train Steps/Sec: 1.59, Samples/Sec: 76.15
[2025-05-03 19:37:02] (step=0002200) Train Loss: 0.1668, Train Steps/Sec: 2.15, Samples/Sec: 103.05
[2025-05-03 19:37:49] (step=0002300) Train Loss: 0.1628, Train Steps/Sec: 2.13, Samples/Sec: 102.43
[2025-05-03 19:38:36] (step=0002400) Train Loss: 0.1686, Train Steps/Sec: 2.13, Samples/Sec: 102.36
[2025-05-03 19:39:23] (step=0002500) Train Loss: 0.1595, Train Steps/Sec: 2.13, Samples/Sec: 102.36
[2025-05-03 19:40:09] (step=0002600) Train Loss: 0.1698, Train Steps/Sec: 2.14, Samples/Sec: 102.95
[2025-05-03 19:40:56] (step=0002700) Train Loss: 0.1662, Train Steps/Sec: 2.14, Samples/Sec: 102.55
[2025-05-03 19:41:43] (step=0002800) Train Loss: 0.1591, Train Steps/Sec: 2.15, Samples/Sec: 103.00
[2025-05-03 19:42:30] (step=0002900) Train Loss: 0.1673, Train Steps/Sec: 2.12, Samples/Sec: 101.75
[2025-05-03 19:43:17] (step=0003000) Train Loss: 0.1561, Train Steps/Sec: 2.15, Samples/Sec: 102.97
[2025-05-03 19:44:03] (step=0003100) Train Loss: 0.1615, Train Steps/Sec: 2.15, Samples/Sec: 103.00
[2025-05-03 19:44:50] (step=0003200) Train Loss: 0.1586, Train Steps/Sec: 2.14, Samples/Sec: 102.50
[2025-05-03 19:45:37] (step=0003300) Train Loss: 0.1537, Train Steps/Sec: 2.12, Samples/Sec: 101.82
[2025-05-03 19:46:24] (step=0003400) Train Loss: 0.1555, Train Steps/Sec: 2.14, Samples/Sec: 102.96
[2025-05-03 19:47:10] (step=0003500) Train Loss: 0.1598, Train Steps/Sec: 2.15, Samples/Sec: 103.00
[2025-05-03 19:47:57] (step=0003600) Train Loss: 0.1564, Train Steps/Sec: 2.14, Samples/Sec: 102.58
[2025-05-03 19:48:44] (step=0003700) Train Loss: 0.1616, Train Steps/Sec: 2.13, Samples/Sec: 102.32
[2025-05-03 19:49:31] (step=0003800) Train Loss: 0.1593, Train Steps/Sec: 2.13, Samples/Sec: 102.45
[2025-05-03 19:50:18] (step=0003900) Train Loss: 0.1575, Train Steps/Sec: 2.14, Samples/Sec: 102.94
[2025-05-03 19:51:04] (step=0004000) Train Loss: 0.1603, Train Steps/Sec: 2.13, Samples/Sec: 102.37
[2025-05-03 19:51:19] Saved checkpoint to logs/nwm_cdit_m/checkpoints/latest.pth.tar
[2025-05-03 19:52:06] (step=0004100) Train Loss: 0.1566, Train Steps/Sec: 1.62, Samples/Sec: 77.61
[2025-05-03 19:52:53] (step=0004200) Train Loss: 0.1528, Train Steps/Sec: 2.13, Samples/Sec: 102.45
[2025-05-03 19:53:40] (step=0004300) Train Loss: 0.1591, Train Steps/Sec: 2.15, Samples/Sec: 102.97
[2025-05-03 19:54:27] (step=0004400) Train Loss: 0.1582, Train Steps/Sec: 2.14, Samples/Sec: 102.53
[2025-05-03 19:55:13] (step=0004500) Train Loss: 0.1539, Train Steps/Sec: 2.13, Samples/Sec: 102.45
[2025-05-03 19:56:00] (step=0004600) Train Loss: 0.1567, Train Steps/Sec: 2.13, Samples/Sec: 102.45
[2025-05-03 19:56:47] (step=0004700) Train Loss: 0.1534, Train Steps/Sec: 2.15, Samples/Sec: 103.05
[2025-05-03 19:57:33] (step=0004800) Train Loss: 0.1592, Train Steps/Sec: 2.15, Samples/Sec: 103.00
[2025-05-03 19:58:20] (step=0004900) Train Loss: 0.1558, Train Steps/Sec: 2.13, Samples/Sec: 102.47
[2025-05-03 19:59:07] (step=0005000) Train Loss: 0.1563, Train Steps/Sec: 2.12, Samples/Sec: 101.89
[2025-05-03 19:59:54] (step=0005100) Train Loss: 0.1567, Train Steps/Sec: 2.15, Samples/Sec: 103.02
[2025-05-03 20:00:41] (step=0005200) Train Loss: 0.1473, Train Steps/Sec: 2.15, Samples/Sec: 103.10
[2025-05-03 20:01:27] (step=0005300) Train Loss: 0.1503, Train Steps/Sec: 2.13, Samples/Sec: 102.40
[2025-05-03 20:02:14] (step=0005400) Train Loss: 0.1573, Train Steps/Sec: 2.13, Samples/Sec: 102.44
[2025-05-03 20:03:01] (step=0005500) Train Loss: 0.1503, Train Steps/Sec: 2.14, Samples/Sec: 102.49
[2025-05-03 20:03:48] (step=0005600) Train Loss: 0.1553, Train Steps/Sec: 2.15, Samples/Sec: 103.02
[2025-05-03 20:04:35] (step=0005700) Train Loss: 0.1517, Train Steps/Sec: 2.14, Samples/Sec: 102.55
[2025-05-03 20:05:21] (step=0005800) Train Loss: 0.1590, Train Steps/Sec: 2.13, Samples/Sec: 102.40
[2025-05-03 20:06:08] (step=0005900) Train Loss: 0.1487, Train Steps/Sec: 2.13, Samples/Sec: 102.44
[2025-05-03 20:06:55] (step=0006000) Train Loss: 0.1486, Train Steps/Sec: 2.14, Samples/Sec: 102.92
[2025-05-03 20:07:10] Saved checkpoint to logs/nwm_cdit_m/checkpoints/latest.pth.tar
[2025-05-03 20:07:57] (step=0006100) Train Loss: 0.1519, Train Steps/Sec: 1.61, Samples/Sec: 77.30
[2025-05-03 20:08:44] (step=0006200) Train Loss: 0.1544, Train Steps/Sec: 2.15, Samples/Sec: 103.04
[2025-05-03 20:09:31] (step=0006300) Train Loss: 0.1520, Train Steps/Sec: 2.13, Samples/Sec: 102.01
[2025-05-03 20:10:17] (step=0006400) Train Loss: 0.1439, Train Steps/Sec: 2.15, Samples/Sec: 103.02
[2025-05-03 20:11:04] (step=0006500) Train Loss: 0.1527, Train Steps/Sec: 2.15, Samples/Sec: 103.01
[2025-05-03 20:11:51] (step=0006600) Train Loss: 0.1510, Train Steps/Sec: 2.13, Samples/Sec: 102.31
[2025-05-03 20:12:38] (step=0006700) Train Loss: 0.1495, Train Steps/Sec: 2.12, Samples/Sec: 101.83
[2025-05-03 20:13:25] (step=0006800) Train Loss: 0.1514, Train Steps/Sec: 2.15, Samples/Sec: 102.98
[2025-05-03 20:14:11] (step=0006900) Train Loss: 0.1505, Train Steps/Sec: 2.14, Samples/Sec: 102.89
[2025-05-03 20:14:58] (step=0007000) Train Loss: 0.1450, Train Steps/Sec: 2.13, Samples/Sec: 102.45
[2025-05-03 20:15:45] (step=0007100) Train Loss: 0.1522, Train Steps/Sec: 2.15, Samples/Sec: 103.02
[2025-05-03 20:16:32] (step=0007200) Train Loss: 0.1496, Train Steps/Sec: 2.12, Samples/Sec: 101.90
[2025-05-03 20:17:18] (step=0007300) Train Loss: 0.1483, Train Steps/Sec: 2.15, Samples/Sec: 103.08
[2025-05-03 20:18:05] (step=0007400) Train Loss: 0.1457, Train Steps/Sec: 2.14, Samples/Sec: 102.48
[2025-05-03 20:18:52] (step=0007500) Train Loss: 0.1514, Train Steps/Sec: 2.15, Samples/Sec: 103.07
[2025-05-03 20:19:39] (step=0007600) Train Loss: 0.1475, Train Steps/Sec: 2.12, Samples/Sec: 101.98
[2025-05-03 20:20:25] (step=0007700) Train Loss: 0.1506, Train Steps/Sec: 2.15, Samples/Sec: 103.07
[2025-05-03 20:21:12] (step=0007800) Train Loss: 0.1528, Train Steps/Sec: 2.14, Samples/Sec: 102.50
[2025-05-03 20:21:59] (step=0007900) Train Loss: 0.1442, Train Steps/Sec: 2.15, Samples/Sec: 103.03
[2025-05-03 20:22:46] (step=0008000) Train Loss: 0.1514, Train Steps/Sec: 2.12, Samples/Sec: 101.91
[2025-05-03 20:23:01] Saved checkpoint to logs/nwm_cdit_m/checkpoints/latest.pth.tar
[2025-05-03 20:23:47] (step=0008100) Train Loss: 0.1502, Train Steps/Sec: 1.62, Samples/Sec: 77.90
[2025-05-03 20:24:34] (step=0008200) Train Loss: 0.1422, Train Steps/Sec: 2.15, Samples/Sec: 103.09
[2025-05-03 20:25:21] (step=0008300) Train Loss: 0.1492, Train Steps/Sec: 2.14, Samples/Sec: 102.51
[2025-05-03 20:26:08] (step=0008400) Train Loss: 0.1483, Train Steps/Sec: 2.12, Samples/Sec: 101.88
[2025-05-03 20:26:55] (step=0008500) Train Loss: 0.1516, Train Steps/Sec: 2.14, Samples/Sec: 102.96
[2025-05-03 20:27:41] (step=0008600) Train Loss: 0.1456, Train Steps/Sec: 2.15, Samples/Sec: 103.13
[2025-05-03 20:28:28] (step=0008700) Train Loss: 0.1442, Train Steps/Sec: 2.13, Samples/Sec: 102.47
[2025-05-03 20:29:15] (step=0008800) Train Loss: 0.1426, Train Steps/Sec: 2.13, Samples/Sec: 102.42
[2025-05-03 20:30:02] (step=0008900) Train Loss: 0.1527, Train Steps/Sec: 2.14, Samples/Sec: 102.51
[2025-05-03 20:30:48] (step=0009000) Train Loss: 0.1414, Train Steps/Sec: 2.15, Samples/Sec: 103.05
[2025-05-03 20:31:35] (step=0009100) Train Loss: 0.1405, Train Steps/Sec: 2.13, Samples/Sec: 102.41
[2025-05-03 20:32:22] (step=0009200) Train Loss: 0.1449, Train Steps/Sec: 2.14, Samples/Sec: 102.53
[2025-05-03 20:33:09] (step=0009300) Train Loss: 0.1420, Train Steps/Sec: 2.13, Samples/Sec: 102.41
[2025-05-03 20:33:55] (step=0009400) Train Loss: 0.1454, Train Steps/Sec: 2.15, Samples/Sec: 103.00
[2025-05-03 20:34:42] (step=0009500) Train Loss: 0.1462, Train Steps/Sec: 2.14, Samples/Sec: 102.50
[2025-05-03 20:35:29] (step=0009600) Train Loss: 0.1490, Train Steps/Sec: 2.14, Samples/Sec: 102.90
[2025-05-03 20:36:16] (step=0009700) Train Loss: 0.1443, Train Steps/Sec: 2.12, Samples/Sec: 101.84
[2025-05-03 20:37:03] (step=0009800) Train Loss: 0.1417, Train Steps/Sec: 2.14, Samples/Sec: 102.87
[2025-05-03 20:37:50] (step=0009900) Train Loss: 0.1448, Train Steps/Sec: 2.13, Samples/Sec: 102.34
[2025-05-03 20:38:36] (step=0010000) Train Loss: 0.1431, Train Steps/Sec: 2.15, Samples/Sec: 103.01
Downloading: "https://github.com/facebookresearch/dino/zipball/main" to ./models/main.zip
Downloading: "https://github.com/facebookresearch/dino/zipball/main" to ./models/main.zip
[2025-05-03 20:38:52] Saved checkpoint to logs/nwm_cdit_m/checkpoints/latest.pth.tar
Downloading: "https://github.com/facebookresearch/dino/zipball/main" to ./models/main.zip
Traceback (most recent call last):
  File "train.py", line 437, in <module>
    main(args)
  File "train.py", line 352, in main
    sim_score = evaluate(ema, tokenizer, diffusion, test_dataset, rank, config["batch_size"], config["num_workers"], latent_size, device, save_dir, args.global_seed, bfloat_enable, num_cond)
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "train.py", line 384, in evaluate
    eval_model, _ = dreamsim(pretrained=True)
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/dreamsim/model.py", line 275, in dreamsim
    ours_model = PerceptualModel(**dreamsim_args['model_config'][dreamsim_type], device=device, load_dir=cache_dir,
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/dreamsim/model.py", line 65, in __init__
    ViTExtractor(model_type, stride, load_dir, device=device)
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/dreamsim/feature_extraction/extractor.py", line 44, in __init__
    self.model = ViTExtractor.create_model(model_type, load_dir)
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/dreamsim/feature_extraction/extractor.py", line 72, in create_model
    model = torch.hub.load('facebookresearch/dino:main', model_type)
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/hub.py", line 563, in load
    repo_or_dir = _get_cache_or_reload(repo_or_dir, force_reload, trust_repo, "load",
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/hub.py", line 238, in _get_cache_or_reload
    download_url_to_file(url, cached_file, progress=False)
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/hub.py", line 620, in download_url_to_file
    u = urlopen(req)
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 542, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 1397, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 1358, in do_open
    r = h.getresponse()
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/http/client.py", line 1348, in getresponse
    response.begin()
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/http/client.py", line 285, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
Traceback (most recent call last):
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 1354, in do_open
    h.request(req.get_method(), req.selector, req.data, headers,
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/http/client.py", line 1256, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/http/client.py", line 1302, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/http/client.py", line 1251, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/http/client.py", line 1011, in _send_output
    self.send(msg)
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/http/client.py", line 951, in send
    self.connect()
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/http/client.py", line 1418, in connect
    super().connect()
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/http/client.py", line 922, in connect
    self.sock = self._create_connection(
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/socket.py", line 820, in create_connection
    raise err
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/socket.py", line 808, in create_connection
    sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 437, in <module>
    main(args)
  File "train.py", line 352, in main
    sim_score = evaluate(ema, tokenizer, diffusion, test_dataset, rank, config["batch_size"], config["num_workers"], latent_size, device, save_dir, args.global_seed, bfloat_enable, num_cond)
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "train.py", line 384, in evaluate
    eval_model, _ = dreamsim(pretrained=True)
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/dreamsim/model.py", line 275, in dreamsim
    ours_model = PerceptualModel(**dreamsim_args['model_config'][dreamsim_type], device=device, load_dir=cache_dir,
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/dreamsim/model.py", line 65, in __init__
    ViTExtractor(model_type, stride, load_dir, device=device)
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/dreamsim/feature_extraction/extractor.py", line 44, in __init__
    self.model = ViTExtractor.create_model(model_type, load_dir)
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/dreamsim/feature_extraction/extractor.py", line 72, in create_model
    model = torch.hub.load('facebookresearch/dino:main', model_type)
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/hub.py", line 563, in load
    repo_or_dir = _get_cache_or_reload(repo_or_dir, force_reload, trust_repo, "load",
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/hub.py", line 238, in _get_cache_or_reload
    download_url_to_file(url, cached_file, progress=False)
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/hub.py", line 620, in download_url_to_file
    u = urlopen(req)
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 542, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 1397, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/urllib/request.py", line 1357, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 110] Connection timed out>
[2025-05-03 20:41:14,779] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 715331 closing signal SIGTERM
[2025-05-03 20:41:14,780] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 715440 closing signal SIGTERM
[2025-05-03 20:41:14,944] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 2 (pid: 715461) of binary: /data1/zwc/miniconda3/envs/nwm2/bin/python
Traceback (most recent call last):
  File "/data1/tpz/anaconda3/envs/nwm2/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/data1/zwc/miniconda3/envs/nwm2/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-05-03_20:41:14
  host      : localhost
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 715461)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================