Junyi42 commited on
Commit
4f6e178
·
verified ·
1 Parent(s): 728c8ff

Upload checkpoints_vlm_gym_patch_reassembly_alt_one_image_lr2e_5_mse_only_ins/checkpoints_vlm_gym_patch_reassembly_alt_one_image_lr2e_5_mse_only_ins

Browse files
checkpoints_vlm_gym_patch_reassembly_alt_one_image_lr2e_5_mse_only_ins/checkpoints_vlm_gym_patch_reassembly_alt_one_image_lr2e_5_mse_only_ins/wandb/offline-run-20260129_223638-vlm_gym_patch_reassembly_alt_one_img_lr2e_5_mse_only_ins-run0/files/config.yaml CHANGED
@@ -0,0 +1,456 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ wandb_version: 1
2
+
3
+ _wandb:
4
+ desc: null
5
+ value:
6
+ python_version: 3.11.10
7
+ cli_version: 0.23.1
8
+ framework: huggingface
9
+ huggingface_version: 4.49.0
10
+ is_jupyter_run: false
11
+ is_kaggle_kernel: false
12
+ start_time: 1769726198
13
+ t:
14
+ 1:
15
+ - 1
16
+ - 5
17
+ - 11
18
+ - 41
19
+ - 49
20
+ - 53
21
+ - 71
22
+ - 105
23
+ 2:
24
+ - 1
25
+ - 5
26
+ - 11
27
+ - 41
28
+ - 49
29
+ - 53
30
+ - 71
31
+ - 105
32
+ 3:
33
+ - 4
34
+ - 13
35
+ - 14
36
+ - 37
37
+ - 42
38
+ - 61
39
+ 4: 3.11.10
40
+ 5: 0.23.1
41
+ 6: 4.49.0
42
+ 13: linux-x86_64
43
+ e:
44
+ ittdtbt5kh132nrytiu95slls6ihhpwy:
45
+ os: Linux-6.6.93+-x86_64-with-glibc2.35
46
+ python: CPython 3.11.10
47
+ started_at: '2026-01-29T22:36:38.281927Z'
48
+ args:
49
+ - --dataset_config_file
50
+ - ./data/configs/vlm_gym_patch_reassembly_alt_train_mseloss_only.yaml
51
+ - --eval_dataset_config_file
52
+ - ./data/configs/vlm_gym_patch_reassembly_alt_eval_mseloss_only.yaml
53
+ - --viz_dataset_config_file
54
+ - ./data/configs/vlm_gym_patch_reassembly_alt_eval_mseloss_only.yaml
55
+ - --inference_hash_file
56
+ - /home/clouduser/Code/Github/launch_new/hashes_test_set_v10.json
57
+ - --task_name
58
+ - patch_reassembly_v5
59
+ - --instructions_dir
60
+ - ./data/instructions
61
+ - --train_data_dir
62
+ - /home/clouduser/Code/data/gym/patch_reassembly_alt_v5/train/
63
+ - --train_jsonl_path
64
+ - /home/clouduser/Code/data/gym/patch_reassembly_alt_v5/train/
65
+ - --eval_data_dir
66
+ - /home/clouduser/Code/data/gym/patch_reassembly_alt_v5/val/
67
+ - --eval_jsonl_path
68
+ - /home/clouduser/Code/data/gym/patch_reassembly_alt_v5/val/
69
+ - --model_path
70
+ - /home/clouduser/Code/Models/BAGEL-7B-MoT
71
+ - --layer_module
72
+ - Qwen2MoTDecoderLayer
73
+ - --max_latent_size
74
+ - '64'
75
+ - --resume-from
76
+ - /home/clouduser/Code/Models/BAGEL-7B-MoT
77
+ - --finetune_from_hf
78
+ - 'True'
79
+ - --auto_resume
80
+ - 'False'
81
+ - --resume-model-only
82
+ - 'True'
83
+ - --finetune-from-ema
84
+ - 'True'
85
+ - --log_every
86
+ - '1'
87
+ - --lr
88
+ - 2e-5
89
+ - --warmup_steps
90
+ - '300'
91
+ - --lr_scheduler
92
+ - cosine
93
+ - --num_worker
94
+ - '1'
95
+ - --expected_num_tokens
96
+ - '30000'
97
+ - --max_num_tokens
98
+ - '30000'
99
+ - --max_num_tokens_per_sample
100
+ - '30000'
101
+ - --visual_und
102
+ - 'True'
103
+ - --save_every
104
+ - '5000'
105
+ - --total_steps
106
+ - '5000'
107
+ - --text_cond_dropout_prob
108
+ - '0.0'
109
+ - --vae_cond_dropout_prob
110
+ - '0.0'
111
+ - --vit_cond_dropout_prob
112
+ - '0.0'
113
+ - --ema
114
+ - '0.993'
115
+ - --checkpoint_dir
116
+ - /dev/shm/models/checkpoints_vlm_gym_patch_reassembly_alt_one_image_lr2e_5_mse_only_ins
117
+ - --wandb_project
118
+ - bagel
119
+ - --wandb_name
120
+ - vlm_gym_patch_reassembly_alt_one_img_lr2e_5_mse_only_ins
121
+ - --wandb_dir
122
+ - /dev/shm/models/checkpoints_vlm_gym_patch_reassembly_alt_one_image_lr2e_5_mse_only_ins
123
+ - --wandb_offline
124
+ - 'True'
125
+ program: /home/clouduser/Code/Github/unified_world_model/train/pretrain_unified_navit.py
126
+ code_path: train/pretrain_unified_navit.py
127
+ code_path_local: train/pretrain_unified_navit.py
128
+ git:
129
+ remote_url: https://github.com/para-lost/unified_world_model
130
+ commit: 8d7b26b7e552fc87b592cf3be94d93be7aeca2a9
131
+ root: /dev/shm/models/checkpoints_vlm_gym_patch_reassembly_alt_one_image_lr2e_5_mse_only_ins
132
+ host: junyizhang-launch-new-226785934-1-0
133
+ executable: /opt/conda/bin/python3.11
134
+ cpu_count: 48
135
+ cpu_count_logical: 96
136
+ gpu_type: NVIDIA A100-SXM4-80GB
137
+ gpu_count: 8
138
+ disk:
139
+ /:
140
+ total: '1052461830144'
141
+ used: '179527671808'
142
+ memory:
143
+ total: '1437332606976'
144
+ gpu_nvidia:
145
+ - name: NVIDIA A100-SXM4-80GB
146
+ memory_total: '85899345920'
147
+ cuda_cores: 6912
148
+ architecture: Ampere
149
+ uuid: GPU-f4aaac9b-3a87-794b-6e6c-15c16dbe16e0
150
+ - name: NVIDIA A100-SXM4-80GB
151
+ memory_total: '85899345920'
152
+ cuda_cores: 6912
153
+ architecture: Ampere
154
+ uuid: GPU-2f859169-51c8-27b2-ce9a-fccc2476cd01
155
+ - name: NVIDIA A100-SXM4-80GB
156
+ memory_total: '85899345920'
157
+ cuda_cores: 6912
158
+ architecture: Ampere
159
+ uuid: GPU-bfb6fecc-609c-c84a-a7c3-42cf7cb62146
160
+ - name: NVIDIA A100-SXM4-80GB
161
+ memory_total: '85899345920'
162
+ cuda_cores: 6912
163
+ architecture: Ampere
164
+ uuid: GPU-bf9a144c-5481-d388-df94-ad3c5c62d0cc
165
+ - name: NVIDIA A100-SXM4-80GB
166
+ memory_total: '85899345920'
167
+ cuda_cores: 6912
168
+ architecture: Ampere
169
+ uuid: GPU-6272c62b-809e-5ca4-0bab-bc5c95571384
170
+ - name: NVIDIA A100-SXM4-80GB
171
+ memory_total: '85899345920'
172
+ cuda_cores: 6912
173
+ architecture: Ampere
174
+ uuid: GPU-fab44e96-66a8-de9f-ffb2-d15bd5745b62
175
+ - name: NVIDIA A100-SXM4-80GB
176
+ memory_total: '85899345920'
177
+ cuda_cores: 6912
178
+ architecture: Ampere
179
+ uuid: GPU-6a70a600-5064-4828-f913-0823badd1507
180
+ - name: NVIDIA A100-SXM4-80GB
181
+ memory_total: '85899345920'
182
+ cuda_cores: 6912
183
+ architecture: Ampere
184
+ uuid: GPU-771a3859-a52f-931c-62d6-00b3ae2b8c67
185
+ cuda_version: '12.2'
186
+ writer_id: ittdtbt5kh132nrytiu95slls6ihhpwy
187
+ visual_gen:
188
+ desc: null
189
+ value: true
190
+ visual_und:
191
+ desc: null
192
+ value: true
193
+ results_dir:
194
+ desc: null
195
+ value: results
196
+ checkpoint_dir:
197
+ desc: null
198
+ value: /dev/shm/models/checkpoints_vlm_gym_patch_reassembly_alt_one_image_lr2e_5_mse_only_ins
199
+ wandb_project:
200
+ desc: null
201
+ value: bagel
202
+ wandb_name:
203
+ desc: null
204
+ value: vlm_gym_patch_reassembly_alt_one_img_lr2e_5_mse_only_ins
205
+ wandb_runid:
206
+ desc: null
207
+ value: '0'
208
+ wandb_resume:
209
+ desc: null
210
+ value: allow
211
+ wandb_offline:
212
+ desc: null
213
+ value: true
214
+ wandb_dir:
215
+ desc: null
216
+ value: /dev/shm/models/checkpoints_vlm_gym_patch_reassembly_alt_one_image_lr2e_5_mse_only_ins
217
+ global_seed:
218
+ desc: null
219
+ value: 4396
220
+ auto_resume:
221
+ desc: null
222
+ value: false
223
+ resume_from:
224
+ desc: null
225
+ value: /home/clouduser/Code/Models/BAGEL-7B-MoT
226
+ resume_model_only:
227
+ desc: null
228
+ value: true
229
+ finetune_from_ema:
230
+ desc: null
231
+ value: true
232
+ finetune_from_hf:
233
+ desc: null
234
+ value: true
235
+ log_every:
236
+ desc: null
237
+ value: 1
238
+ save_every:
239
+ desc: null
240
+ value: 5000
241
+ total_steps:
242
+ desc: null
243
+ value: 5000
244
+ warmup_steps:
245
+ desc: null
246
+ value: 300
247
+ lr_scheduler:
248
+ desc: null
249
+ value: cosine
250
+ lr:
251
+ desc: null
252
+ value: 2.0e-05
253
+ min_lr:
254
+ desc: null
255
+ value: 1.0e-07
256
+ beta1:
257
+ desc: null
258
+ value: 0.9
259
+ beta2:
260
+ desc: null
261
+ value: 0.95
262
+ eps:
263
+ desc: null
264
+ value: 1.0e-15
265
+ ema:
266
+ desc: null
267
+ value: 0.993
268
+ max_grad_norm:
269
+ desc: null
270
+ value: 1.0
271
+ timestep_shift:
272
+ desc: null
273
+ value: 1.0
274
+ mse_weight:
275
+ desc: null
276
+ value: 1.0
277
+ ce_weight:
278
+ desc: null
279
+ value: 1.0
280
+ ce_loss_reweighting:
281
+ desc: null
282
+ value: false
283
+ expected_num_tokens:
284
+ desc: null
285
+ value: 30000
286
+ num_replicate:
287
+ desc: null
288
+ value: 1
289
+ num_shard:
290
+ desc: null
291
+ value: 8
292
+ sharding_strategy:
293
+ desc: null
294
+ value: HYBRID_SHARD
295
+ backward_prefetch:
296
+ desc: null
297
+ value: BACKWARD_PRE
298
+ cpu_offload:
299
+ desc: null
300
+ value: false
301
+ freeze_llm:
302
+ desc: null
303
+ value: false
304
+ freeze_vit:
305
+ desc: null
306
+ value: false
307
+ freeze_vae:
308
+ desc: null
309
+ value: true
310
+ freeze_und:
311
+ desc: null
312
+ value: false
313
+ copy_init_moe:
314
+ desc: null
315
+ value: true
316
+ use_flex:
317
+ desc: null
318
+ value: false
319
+ eval_every:
320
+ desc: null
321
+ value: 500
322
+ num_eval_batches:
323
+ desc: null
324
+ value: 20
325
+ use_ema_for_eval:
326
+ desc: null
327
+ value: true
328
+ eval_log_dir:
329
+ desc: null
330
+ value: null
331
+ eval_run_tag:
332
+ desc: null
333
+ value: ''
334
+ viz_every:
335
+ desc: null
336
+ value: 500
337
+ viz_n:
338
+ desc: null
339
+ value: 8
340
+ viz_outdir:
341
+ desc: null
342
+ value: results/viz
343
+ eval_dataset_config_file:
344
+ desc: null
345
+ value: ./data/configs/vlm_gym_patch_reassembly_alt_eval_mseloss_only.yaml
346
+ viz_dataset_config_file:
347
+ desc: null
348
+ value: ./data/configs/vlm_gym_patch_reassembly_alt_eval_mseloss_only.yaml
349
+ eval_print_n:
350
+ desc: null
351
+ value: 3
352
+ save_ema_only:
353
+ desc: null
354
+ value: true
355
+ save_optimizer:
356
+ desc: null
357
+ value: false
358
+ model_path:
359
+ desc: null
360
+ value: /home/clouduser/Code/Models/BAGEL-7B-MoT
361
+ llm_path:
362
+ desc: null
363
+ value: hf/Qwen2.5-0.5B-Instruct/
364
+ llm_qk_norm:
365
+ desc: null
366
+ value: true
367
+ tie_word_embeddings:
368
+ desc: null
369
+ value: false
370
+ layer_module:
371
+ desc: null
372
+ value: Qwen2MoTDecoderLayer
373
+ vae_path:
374
+ desc: null
375
+ value: flux/vae/ae.safetensors
376
+ vit_path:
377
+ desc: null
378
+ value: hf/siglip-so400m-14-980-flash-attn2-navit/
379
+ max_latent_size:
380
+ desc: null
381
+ value: 64
382
+ latent_patch_size:
383
+ desc: null
384
+ value: 2
385
+ vit_patch_size:
386
+ desc: null
387
+ value: 14
388
+ vit_max_num_patch_per_side:
389
+ desc: null
390
+ value: 70
391
+ connector_act:
392
+ desc: null
393
+ value: gelu_pytorch_tanh
394
+ interpolate_pos:
395
+ desc: null
396
+ value: false
397
+ vit_select_layer:
398
+ desc: null
399
+ value: -2
400
+ vit_rope:
401
+ desc: null
402
+ value: false
403
+ text_cond_dropout_prob:
404
+ desc: null
405
+ value: 0.0
406
+ vae_cond_dropout_prob:
407
+ desc: null
408
+ value: 0.0
409
+ vit_cond_dropout_prob:
410
+ desc: null
411
+ value: 0.0
412
+ dataset_config_file:
413
+ desc: null
414
+ value: ./data/configs/vlm_gym_patch_reassembly_alt_train_mseloss_only.yaml
415
+ train_data_dir:
416
+ desc: null
417
+ value: /home/clouduser/Code/data/gym/patch_reassembly_alt_v5/train/
418
+ train_jsonl_path:
419
+ desc: null
420
+ value: /home/clouduser/Code/data/gym/patch_reassembly_alt_v5/train/
421
+ eval_data_dir:
422
+ desc: null
423
+ value: /home/clouduser/Code/data/gym/patch_reassembly_alt_v5/val/
424
+ eval_jsonl_path:
425
+ desc: null
426
+ value: /home/clouduser/Code/data/gym/patch_reassembly_alt_v5/val/
427
+ inference_hash_file:
428
+ desc: null
429
+ value: /home/clouduser/Code/Github/launch_new/hashes_test_set_v10.json
430
+ task_name:
431
+ desc: null
432
+ value: patch_reassembly_v5
433
+ instructions_dir:
434
+ desc: null
435
+ value: ./data/instructions
436
+ prefetch_factor:
437
+ desc: null
438
+ value: 2
439
+ num_workers:
440
+ desc: null
441
+ value: 1
442
+ max_num_tokens_per_sample:
443
+ desc: null
444
+ value: 30000
445
+ max_num_tokens:
446
+ desc: null
447
+ value: 30000
448
+ prefer_buffer_before:
449
+ desc: null
450
+ value: 16384
451
+ max_buffer_size:
452
+ desc: null
453
+ value: 50
454
+ data_seed:
455
+ desc: null
456
+ value: 42
checkpoints_vlm_gym_patch_reassembly_alt_one_image_lr2e_5_mse_only_ins/checkpoints_vlm_gym_patch_reassembly_alt_one_image_lr2e_5_mse_only_ins/wandb/offline-run-20260129_223638-vlm_gym_patch_reassembly_alt_one_img_lr2e_5_mse_only_ins-run0/files/output.log CHANGED
@@ -782,16 +782,6 @@ wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
782
  [2026-01-30 03:37:07] (step=0000771) Train Loss mse: 0.0162, Train Loss ce: 0.0000, Train Steps/Sec: 0.05,
783
  [2026-01-30 03:37:26] (step=0000772) Train Loss mse: 0.0173, Train Loss ce: 0.0000, Train Steps/Sec: 0.05,
784
  [2026-01-30 03:37:50] (step=0000773) Train Loss mse: 0.0135, Train Loss ce: 0.0000, Train Steps/Sec: 0.04,
785
- [2026-01-30 03:38:14] (step=0000774) Train Loss mse: 0.0161, Train Loss ce: 0.0000, Train Steps/Sec: 0.04,
786
- [2026-01-30 03:38:36] (step=0000775) Train Loss mse: 0.0152, Train Loss ce: 0.0000, Train Steps/Sec: 0.05,
787
- [2026-01-30 03:38:56] (step=0000776) Train Loss mse: 0.0159, Train Loss ce: 0.0000, Train Steps/Sec: 0.05,
788
- [2026-01-30 03:39:22] (step=0000777) Train Loss mse: 0.0153, Train Loss ce: 0.0000, Train Steps/Sec: 0.04,
789
- [2026-01-30 03:39:43] (step=0000778) Train Loss mse: 0.0177, Train Loss ce: 0.0000, Train Steps/Sec: 0.05,
790
- [2026-01-30 03:40:01] (step=0000779) Train Loss mse: 0.0168, Train Loss ce: 0.0000, Train Steps/Sec: 0.05,
791
- [2026-01-30 03:40:21] (step=0000780) Train Loss mse: 0.0177, Train Loss ce: 0.0000, Train Steps/Sec: 0.05,
792
- [2026-01-30 03:40:48] (step=0000781) Train Loss mse: 0.0157, Train Loss ce: 0.0000, Train Steps/Sec: 0.04,
793
- [2026-01-30 03:41:12] (step=0000782) Train Loss mse: 0.0180, Train Loss ce: 0.0000, Train Steps/Sec: 0.04,
794
- [2026-01-30 03:41:35] (step=0000783) Train Loss mse: 0.0171, Train Loss ce: 0.0000, Train Steps/Sec: 0.04,
795
  FullyShardedDataParallel(
796
  (_fsdp_wrapped_module): Bagel(
797
  (language_model): Qwen2ForCausalLM(
@@ -978,6 +968,30 @@ Preparing Dataset vlm_gym_patch_reassembly_alt_mse_loss_only_evalonce/vlm_gym_pa
978
  fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_patch_reassembly_alt_mse_loss_only_evalonce'}]
979
  fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_patch_reassembly_alt_mse_loss_only_evalonce'}]
980
  ce_avg: 0.0, mse_avg: 0.017283011227846146
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
981
  [2026-01-30 03:41:57] (step=0000784) Train Loss mse: 0.0161, Train Loss ce: 0.0000, Train Steps/Sec: 0.05,
982
  [2026-01-30 03:42:24] (step=0000785) Train Loss mse: 0.0168, Train Loss ce: 0.0000, Train Steps/Sec: 0.04,
983
  [2026-01-30 03:42:48] (step=0000786) Train Loss mse: 0.0156, Train Loss ce: 0.0000, Train Steps/Sec: 0.04,
@@ -1726,18 +1740,4 @@ ce_avg: 0.0, mse_avg: 0.017283011227846146
1726
  [2026-01-30 08:23:09] (step=0001529) Train Loss mse: 0.0158, Train Loss ce: 0.0000, Train Steps/Sec: 0.04,
1727
  [2026-01-30 08:23:31] (step=0001530) Train Loss mse: 0.0151, Train Loss ce: 0.0000, Train Steps/Sec: 0.04,
1728
  [2026-01-30 08:23:54] (step=0001531) Train Loss mse: 0.0129, Train Loss ce: 0.0000, Train Steps/Sec: 0.04,
1729
- [2026-01-30 08:24:22] (step=0001532) Train Loss mse: 0.0126, Train Loss ce: 0.0000, Train Steps/Sec: 0.04,
1730
- base_dir is /dev/shm/models/checkpoints_vlm_gym_patch_reassembly_alt_one_image_lr2e_5_mse_only_ins/eval_used_rows, step_tag is vlm_gym_patch_reassembly_alt_one_img_lr2e_5_mse_only_ins_step1000
1731
- Preparing Dataset vlm_gym_patch_reassembly_alt_mse_loss_only_evalonce/vlm_gym_patch_reassembly_alt_val
1732
- [eval debug] first 3 batch fingerprints:
1733
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_patch_reassembly_alt_mse_loss_only_evalonce'}]
1734
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_patch_reassembly_alt_mse_loss_only_evalonce'}]
1735
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_patch_reassembly_alt_mse_loss_only_evalonce'}]
1736
- ce_avg: 0.0, mse_avg: 0.01552750263363123
1737
- base_dir is /dev/shm/models/checkpoints_vlm_gym_patch_reassembly_alt_one_image_lr2e_5_mse_only_ins/eval_used_rows, step_tag is vlm_gym_patch_reassembly_alt_one_img_lr2e_5_mse_only_ins_step1500
1738
- Preparing Dataset vlm_gym_patch_reassembly_alt_mse_loss_only_evalonce/vlm_gym_patch_reassembly_alt_val
1739
- [eval debug] first 3 batch fingerprints:
1740
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_patch_reassembly_alt_mse_loss_only_evalonce'}]
1741
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_patch_reassembly_alt_mse_loss_only_evalonce'}]
1742
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_patch_reassembly_alt_mse_loss_only_evalonce'}]
1743
- ce_avg: 0.0, mse_avg: 0.015110340900719166
 
782
  [2026-01-30 03:37:07] (step=0000771) Train Loss mse: 0.0162, Train Loss ce: 0.0000, Train Steps/Sec: 0.05,
783
  [2026-01-30 03:37:26] (step=0000772) Train Loss mse: 0.0173, Train Loss ce: 0.0000, Train Steps/Sec: 0.05,
784
  [2026-01-30 03:37:50] (step=0000773) Train Loss mse: 0.0135, Train Loss ce: 0.0000, Train Steps/Sec: 0.04,
 
 
 
 
 
 
 
 
 
 
785
  FullyShardedDataParallel(
786
  (_fsdp_wrapped_module): Bagel(
787
  (language_model): Qwen2ForCausalLM(
 
968
  fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_patch_reassembly_alt_mse_loss_only_evalonce'}]
969
  fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_patch_reassembly_alt_mse_loss_only_evalonce'}]
970
  ce_avg: 0.0, mse_avg: 0.017283011227846146
971
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_patch_reassembly_alt_one_image_lr2e_5_mse_only_ins/eval_used_rows, step_tag is vlm_gym_patch_reassembly_alt_one_img_lr2e_5_mse_only_ins_step1000
972
+ Preparing Dataset vlm_gym_patch_reassembly_alt_mse_loss_only_evalonce/vlm_gym_patch_reassembly_alt_val
973
+ [eval debug] first 3 batch fingerprints:
974
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_patch_reassembly_alt_mse_loss_only_evalonce'}]
975
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_patch_reassembly_alt_mse_loss_only_evalonce'}]
976
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_patch_reassembly_alt_mse_loss_only_evalonce'}]
977
+ ce_avg: 0.0, mse_avg: 0.01552750263363123
978
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_patch_reassembly_alt_one_image_lr2e_5_mse_only_ins/eval_used_rows, step_tag is vlm_gym_patch_reassembly_alt_one_img_lr2e_5_mse_only_ins_step1500
979
+ Preparing Dataset vlm_gym_patch_reassembly_alt_mse_loss_only_evalonce/vlm_gym_patch_reassembly_alt_val
980
+ [eval debug] first 3 batch fingerprints:
981
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_patch_reassembly_alt_mse_loss_only_evalonce'}]
982
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_patch_reassembly_alt_mse_loss_only_evalonce'}]
983
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_patch_reassembly_alt_mse_loss_only_evalonce'}]
984
+ ce_avg: 0.0, mse_avg: 0.015110340900719166
985
+ [2026-01-30 03:38:14] (step=0000774) Train Loss mse: 0.0161, Train Loss ce: 0.0000, Train Steps/Sec: 0.04,
986
+ [2026-01-30 03:38:36] (step=0000775) Train Loss mse: 0.0152, Train Loss ce: 0.0000, Train Steps/Sec: 0.05,
987
+ [2026-01-30 03:38:56] (step=0000776) Train Loss mse: 0.0159, Train Loss ce: 0.0000, Train Steps/Sec: 0.05,
988
+ [2026-01-30 03:39:22] (step=0000777) Train Loss mse: 0.0153, Train Loss ce: 0.0000, Train Steps/Sec: 0.04,
989
+ [2026-01-30 03:39:43] (step=0000778) Train Loss mse: 0.0177, Train Loss ce: 0.0000, Train Steps/Sec: 0.05,
990
+ [2026-01-30 03:40:01] (step=0000779) Train Loss mse: 0.0168, Train Loss ce: 0.0000, Train Steps/Sec: 0.05,
991
+ [2026-01-30 03:40:21] (step=0000780) Train Loss mse: 0.0177, Train Loss ce: 0.0000, Train Steps/Sec: 0.05,
992
+ [2026-01-30 03:40:48] (step=0000781) Train Loss mse: 0.0157, Train Loss ce: 0.0000, Train Steps/Sec: 0.04,
993
+ [2026-01-30 03:41:12] (step=0000782) Train Loss mse: 0.0180, Train Loss ce: 0.0000, Train Steps/Sec: 0.04,
994
+ [2026-01-30 03:41:35] (step=0000783) Train Loss mse: 0.0171, Train Loss ce: 0.0000, Train Steps/Sec: 0.04,
995
  [2026-01-30 03:41:57] (step=0000784) Train Loss mse: 0.0161, Train Loss ce: 0.0000, Train Steps/Sec: 0.05,
996
  [2026-01-30 03:42:24] (step=0000785) Train Loss mse: 0.0168, Train Loss ce: 0.0000, Train Steps/Sec: 0.04,
997
  [2026-01-30 03:42:48] (step=0000786) Train Loss mse: 0.0156, Train Loss ce: 0.0000, Train Steps/Sec: 0.04,
 
1740
  [2026-01-30 08:23:09] (step=0001529) Train Loss mse: 0.0158, Train Loss ce: 0.0000, Train Steps/Sec: 0.04,
1741
  [2026-01-30 08:23:31] (step=0001530) Train Loss mse: 0.0151, Train Loss ce: 0.0000, Train Steps/Sec: 0.04,
1742
  [2026-01-30 08:23:54] (step=0001531) Train Loss mse: 0.0129, Train Loss ce: 0.0000, Train Steps/Sec: 0.04,
1743
+ [2026-01-30 08:24:22] (step=0001532) Train Loss mse: 0.0126, Train Loss ce: 0.0000, Train Steps/Sec: 0.04,