Junyi42 commited on
Commit
77c6e91
·
verified ·
1 Parent(s): e3d786a

Upload checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins

Browse files
checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins/wandb/offline-run-20260119_052527-checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins-run0/files/output.log CHANGED
@@ -1,3 +1,189 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  wandb: Detected [huggingface_hub.inference] in use.
2
  wandb: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
3
  wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
@@ -864,192 +1050,21 @@ wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
864
  [2026-01-19 08:27:49] (step=0000853) Train Loss mse: 0.0668, Train Loss ce: 0.0769, Train Steps/Sec: 0.08,
865
  [2026-01-19 08:27:59] (step=0000854) Train Loss mse: 0.0867, Train Loss ce: 0.0813, Train Steps/Sec: 0.10,
866
  [2026-01-19 08:28:12] (step=0000855) Train Loss mse: 0.0717, Train Loss ce: 0.0791, Train Steps/Sec: 0.07,
867
- FullyShardedDataParallel(
868
- (_fsdp_wrapped_module): Bagel(
869
- (language_model): Qwen2ForCausalLM(
870
- (model): Qwen2Model(
871
- (embed_tokens): Embedding(152064, 3584)
872
- (layers): ModuleList(
873
- (0-27): 28 x FullyShardedDataParallel(
874
- (_fsdp_wrapped_module): CheckpointWrapper(
875
- (_checkpoint_wrapped_module): Qwen2MoTDecoderLayer(
876
- (self_attn): PackedAttentionMoT(
877
- (q_proj): Linear(in_features=3584, out_features=3584, bias=True)
878
- (k_proj): Linear(in_features=3584, out_features=512, bias=True)
879
- (v_proj): Linear(in_features=3584, out_features=512, bias=True)
880
- (o_proj): Linear(in_features=3584, out_features=3584, bias=False)
881
- (q_norm): Qwen2RMSNorm((128,), eps=1e-06)
882
- (k_norm): Qwen2RMSNorm((128,), eps=1e-06)
883
- (q_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
884
- (k_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
885
- (q_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=True)
886
- (k_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
887
- (v_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
888
- (o_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=False)
889
- )
890
- (mlp): Qwen2MLP(
891
- (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
892
- (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
893
- (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
894
- (act_fn): SiLU()
895
- )
896
- (mlp_moe_gen): Qwen2MLP(
897
- (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
898
- (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
899
- (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
900
- (act_fn): SiLU()
901
- )
902
- (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
903
- (input_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
904
- (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
905
- (post_attention_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
906
- )
907
- )
908
- )
909
- )
910
- (norm): Qwen2RMSNorm((3584,), eps=1e-06)
911
- (norm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
912
- (rotary_emb): Qwen2RotaryEmbedding()
913
- )
914
- (lm_head): Linear(in_features=3584, out_features=152064, bias=False)
915
- )
916
- (time_embedder): FullyShardedDataParallel(
917
- (_fsdp_wrapped_module): TimestepEmbedder(
918
- (mlp): Sequential(
919
- (0): Linear(in_features=256, out_features=3584, bias=True)
920
- (1): SiLU()
921
- (2): Linear(in_features=3584, out_features=3584, bias=True)
922
- )
923
- )
924
- )
925
- (vae2llm): Linear(in_features=64, out_features=3584, bias=True)
926
- (llm2vae): Linear(in_features=3584, out_features=64, bias=True)
927
- (latent_pos_embed): FullyShardedDataParallel(
928
- (_fsdp_wrapped_module): PositionEmbedding()
929
- )
930
- (vit_model): SiglipVisionModel(
931
- (vision_model): FullyShardedDataParallel(
932
- (_fsdp_wrapped_module): SiglipVisionTransformer(
933
- (embeddings): SiglipVisionEmbeddings(
934
- (position_embedding): Embedding(4900, 1152)
935
- (patch_embedding): Linear(in_features=588, out_features=1152, bias=True)
936
- )
937
- (encoder): SiglipEncoder(
938
- (layers): ModuleList(
939
- (0-25): 26 x FullyShardedDataParallel(
940
- (_fsdp_wrapped_module): CheckpointWrapper(
941
- (_checkpoint_wrapped_module): SiglipEncoderLayer(
942
- (self_attn): SiglipFlashAttention2(
943
- (k_proj): Linear(in_features=1152, out_features=1152, bias=True)
944
- (v_proj): Linear(in_features=1152, out_features=1152, bias=True)
945
- (q_proj): Linear(in_features=1152, out_features=1152, bias=True)
946
- (out_proj): Linear(in_features=1152, out_features=1152, bias=True)
947
- )
948
- (layer_norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
949
- (mlp): SiglipMLP(
950
- (activation_fn): PytorchGELUTanh()
951
- (fc1): Linear(in_features=1152, out_features=4304, bias=True)
952
- (fc2): Linear(in_features=4304, out_features=1152, bias=True)
953
- )
954
- (layer_norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
955
- )
956
- )
957
- )
958
- )
959
- )
960
- (post_layernorm): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
961
- )
962
- )
963
- )
964
- (connector): FullyShardedDataParallel(
965
- (_fsdp_wrapped_module): CheckpointWrapper(
966
- (_checkpoint_wrapped_module): MLPconnector(
967
- (activation_fn): PytorchGELUTanh()
968
- (fc1): Linear(in_features=1152, out_features=3584, bias=True)
969
- (fc2): Linear(in_features=3584, out_features=3584, bias=True)
970
- )
971
- )
972
- )
973
- (vit_pos_embed): FullyShardedDataParallel(
974
- (_fsdp_wrapped_module): PositionEmbedding()
975
- )
976
- )
977
- )
978
- _flat_param True
979
- language_model.model.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
980
- language_model.model.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
981
- language_model.model.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
982
- language_model.model.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
983
- language_model.model.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
984
- language_model.model.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
985
- language_model.model.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
986
- language_model.model.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
987
- language_model.model.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
988
- language_model.model.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
989
- language_model.model.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
990
- language_model.model.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
991
- language_model.model.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
992
- language_model.model.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
993
- language_model.model.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
994
- language_model.model.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
995
- language_model.model.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
996
- language_model.model.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
997
- language_model.model.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
998
- language_model.model.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
999
- language_model.model.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1000
- language_model.model.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1001
- language_model.model.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1002
- language_model.model.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1003
- language_model.model.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1004
- language_model.model.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1005
- language_model.model.layers.26._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1006
- language_model.model.layers.27._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1007
- time_embedder._fsdp_wrapped_module._flat_param True
1008
- latent_pos_embed._fsdp_wrapped_module._flat_param False
1009
- vit_model.vision_model._fsdp_wrapped_module._flat_param True
1010
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1011
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1012
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1013
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1014
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1015
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1016
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1017
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1018
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1019
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1020
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1021
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1022
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1023
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1024
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1025
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1026
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1027
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1028
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1029
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1030
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1031
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1032
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1033
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1034
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1035
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1036
- connector._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1037
- vit_pos_embed._fsdp_wrapped_module._flat_param False
1038
- Preparing Dataset vlm_gym_jigsaw_swap_celoss/vlm_gym_jigsaw_swap_train
1039
- base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins_step0
1040
- Preparing Dataset vlm_gym_jigsaw_swap_celoss_evalonce/vlm_gym_jigsaw_swap_val
1041
- [eval debug] first 3 batch fingerprints:
1042
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
1043
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
1044
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
1045
- ce_avg: 1.0355833768844604, mse_avg: 0.09119202941656113
1046
- base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins_step500
1047
  Preparing Dataset vlm_gym_jigsaw_swap_celoss_evalonce/vlm_gym_jigsaw_swap_val
1048
  [eval debug] first 3 batch fingerprints:
1049
  fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
1050
  fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
1051
  fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
1052
- ce_avg: 0.08522484451532364, mse_avg: 0.07227052003145218
1053
  base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins_step1500
1054
  Preparing Dataset vlm_gym_jigsaw_swap_celoss_evalonce/vlm_gym_jigsaw_swap_val
1055
  [eval debug] first 3 batch fingerprints:
@@ -1057,21 +1072,6 @@ Preparing Dataset vlm_gym_jigsaw_swap_celoss_evalonce/vlm_gym_jigsaw_swap_val
1057
  fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
1058
  fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
1059
  ce_avg: 0.14740489423274994, mse_avg: 0.06398878246545792
1060
- base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins_step2000
1061
- Preparing Dataset vlm_gym_jigsaw_swap_celoss_evalonce/vlm_gym_jigsaw_swap_val
1062
- [eval debug] first 3 batch fingerprints:
1063
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
1064
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
1065
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
1066
- ce_avg: 0.17242415249347687, mse_avg: 0.06874960660934448
1067
- [2026-01-19 08:28:24] (step=0000856) Train Loss mse: 0.0745, Train Loss ce: 0.0779, Train Steps/Sec: 0.09,
1068
- [2026-01-19 08:28:35] (step=0000857) Train Loss mse: 0.0827, Train Loss ce: 0.0701, Train Steps/Sec: 0.09,
1069
- [2026-01-19 08:28:48] (step=0000858) Train Loss mse: 0.0707, Train Loss ce: 0.0751, Train Steps/Sec: 0.08,
1070
- [2026-01-19 08:29:03] (step=0000859) Train Loss mse: 0.0528, Train Loss ce: 0.0762, Train Steps/Sec: 0.07,
1071
- [2026-01-19 08:29:14] (step=0000860) Train Loss mse: 0.0871, Train Loss ce: 0.0803, Train Steps/Sec: 0.09,
1072
- [2026-01-19 08:29:25] (step=0000861) Train Loss mse: 0.0612, Train Loss ce: 0.0715, Train Steps/Sec: 0.09,
1073
- [2026-01-19 08:29:36] (step=0000862) Train Loss mse: 0.0526, Train Loss ce: 0.0828, Train Steps/Sec: 0.09,
1074
- [2026-01-19 08:29:47] (step=0000863) Train Loss mse: 0.1003, Train Loss ce: 0.0734, Train Steps/Sec: 0.09,
1075
  [2026-01-19 08:29:58] (step=0000864) Train Loss mse: 0.0765, Train Loss ce: 0.0725, Train Steps/Sec: 0.09,
1076
  [2026-01-19 08:30:13] (step=0000865) Train Loss mse: 0.0489, Train Loss ce: 0.0778, Train Steps/Sec: 0.07,
1077
  [2026-01-19 08:30:23] (step=0000866) Train Loss mse: 0.0780, Train Loss ce: 0.0754, Train Steps/Sec: 0.09,
@@ -2296,6 +2296,20 @@ ce_avg: 0.17242415249347687, mse_avg: 0.06874960660934448
2296
  [2026-01-19 12:42:08] (step=0002085) Train Loss mse: 0.0849, Train Loss ce: 0.0714, Train Steps/Sec: 0.10,
2297
  [2026-01-19 12:42:18] (step=0002086) Train Loss mse: 0.0803, Train Loss ce: 0.0749, Train Steps/Sec: 0.10,
2298
  [2026-01-19 12:42:32] (step=0002087) Train Loss mse: 0.0616, Train Loss ce: 0.0706, Train Steps/Sec: 0.07,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2299
  [2026-01-19 12:42:44] (step=0002088) Train Loss mse: 0.0685, Train Loss ce: 0.0754, Train Steps/Sec: 0.08,
2300
  [2026-01-19 12:42:56] (step=0002089) Train Loss mse: 0.0967, Train Loss ce: 0.0695, Train Steps/Sec: 0.09,
2301
  [2026-01-19 12:43:07] (step=0002090) Train Loss mse: 0.0986, Train Loss ce: 0.0797, Train Steps/Sec: 0.09,
@@ -2408,20 +2422,6 @@ ce_avg: 0.17242415249347687, mse_avg: 0.06874960660934448
2408
  [2026-01-19 13:05:10] (step=0002197) Train Loss mse: 0.0664, Train Loss ce: 0.0750, Train Steps/Sec: 0.09,
2409
  [2026-01-19 13:05:20] (step=0002198) Train Loss mse: 0.0880, Train Loss ce: 0.0704, Train Steps/Sec: 0.09,
2410
  [2026-01-19 13:05:33] (step=0002199) Train Loss mse: 0.0675, Train Loss ce: 0.0757, Train Steps/Sec: 0.08,
2411
- base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins_step2500
2412
- Preparing Dataset vlm_gym_jigsaw_swap_celoss_evalonce/vlm_gym_jigsaw_swap_val
2413
- [eval debug] first 3 batch fingerprints:
2414
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
2415
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
2416
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
2417
- ce_avg: 0.19842223823070526, mse_avg: 0.06336650252342224
2418
- base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins_step3000
2419
- Preparing Dataset vlm_gym_jigsaw_swap_celoss_evalonce/vlm_gym_jigsaw_swap_val
2420
- [eval debug] first 3 batch fingerprints:
2421
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
2422
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
2423
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
2424
- ce_avg: 0.07113409787416458, mse_avg: 0.07119625806808472
2425
  [2026-01-19 13:05:44] (step=0002200) Train Loss mse: 0.0508, Train Loss ce: 0.0733, Train Steps/Sec: 0.09,
2426
  [2026-01-19 13:05:55] (step=0002201) Train Loss mse: 0.1194, Train Loss ce: 0.0703, Train Steps/Sec: 0.09,
2427
  [2026-01-19 13:06:08] (step=0002202) Train Loss mse: 0.0828, Train Loss ce: 0.0721, Train Steps/Sec: 0.08,
@@ -3187,6 +3187,20 @@ ce_avg: 0.07113409787416458, mse_avg: 0.07119625806808472
3187
  [2026-01-19 15:43:48] (step=0002959) Train Loss mse: 0.0704, Train Loss ce: 0.0697, Train Steps/Sec: 0.07,
3188
  [2026-01-19 15:44:02] (step=0002960) Train Loss mse: 0.0800, Train Loss ce: 0.0681, Train Steps/Sec: 0.07,
3189
  [2026-01-19 15:44:14] (step=0002961) Train Loss mse: 0.0970, Train Loss ce: 0.0651, Train Steps/Sec: 0.09,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3190
  [2026-01-19 15:44:23] (step=0002962) Train Loss mse: 0.0561, Train Loss ce: 0.0671, Train Steps/Sec: 0.10,
3191
  [2026-01-19 15:44:36] (step=0002963) Train Loss mse: 0.0818, Train Loss ce: 0.0704, Train Steps/Sec: 0.08,
3192
  [2026-01-19 15:44:44] (step=0002964) Train Loss mse: 0.0788, Train Loss ce: 0.0692, Train Steps/Sec: 0.11,
@@ -3475,20 +3489,6 @@ ce_avg: 0.07113409787416458, mse_avg: 0.07119625806808472
3475
  [2026-01-19 16:43:08] (step=0003247) Train Loss mse: 0.0538, Train Loss ce: 0.0729, Train Steps/Sec: 0.09,
3476
  [2026-01-19 16:43:18] (step=0003248) Train Loss mse: 0.1023, Train Loss ce: 0.0670, Train Steps/Sec: 0.10,
3477
  [2026-01-19 16:43:30] (step=0003249) Train Loss mse: 0.0763, Train Loss ce: 0.0710, Train Steps/Sec: 0.08,
3478
- [2026-01-19 16:43:41
3479
- base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins_step3500
3480
- Preparing Dataset vlm_gym_jigsaw_swap_celoss_evalonce/vlm_gym_jigsaw_swap_val
3481
- [eval debug] first 3 batch fingerprints:
3482
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
3483
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
3484
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
3485
- ce_avg: 0.07160378992557526, mse_avg: 0.0721038281917572
3486
- base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins_step4000
3487
- Preparing Dataset vlm_gym_jigsaw_swap_celoss_evalonce/vlm_gym_jigsaw_swap_val
3488
- [eval debug] first 3 batch fingerprints:
3489
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
3490
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
3491
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
3492
  [2026-01-19 16:43:41] (step=0003250) Train Loss mse: 0.0678, Train Loss ce: 0.0710, Train Steps/Sec: 0.09,
3493
  [2026-01-19 16:43:54] (step=0003251) Train Loss mse: 0.1078, Train Loss ce: 0.0676, Train Steps/Sec: 0.08,
3494
  [2026-01-19 16:44:07] (step=0003252) Train Loss mse: 0.0562, Train Loss ce: 0.0690, Train Steps/Sec: 0.07,
@@ -4369,6 +4369,20 @@ Preparing Dataset vlm_gym_jigsaw_swap_celoss_evalonce/vlm_gym_jigsaw_swap_val
4369
  [2026-01-19 19:45:26] (step=0004127) Train Loss mse: 0.0530, Train Loss ce: 0.0686, Train Steps/Sec: 0.09,
4370
  [2026-01-19 19:45:37] (step=0004128) Train Loss mse: 0.0773, Train Loss ce: 0.0697, Train Steps/Sec: 0.09,
4371
  [2026-01-19 19:45:49] (step=0004129) Train Loss mse: 0.0693, Train Loss ce: 0.0689, Train Steps/Sec: 0.08,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4372
  [2026-01-19 19:45:59] (step=0004130) Train Loss mse: 0.0523, Train Loss ce: 0.0727, Train Steps/Sec: 0.10,
4373
  [2026-01-19 19:46:12] (step=0004131) Train Loss mse: 0.0772, Train Loss ce: 0.0679, Train Steps/Sec: 0.07,
4374
  [2026-01-19 19:46:26] (step=0004132) Train Loss mse: 0.0684, Train Loss ce: 0.0704, Train Steps/Sec: 0.07,
@@ -4719,20 +4733,6 @@ Preparing Dataset vlm_gym_jigsaw_swap_celoss_evalonce/vlm_gym_jigsaw_swap_val
4719
  [2026-01-19 20:56:28] (step=0004477) Train Loss mse: 0.0562, Train Loss ce: 0.0708, Train Steps/Sec: 0.07,
4720
  [2026-01-19 20:56:42] (step=0004478) Train Loss mse: 0.0685, Train Loss ce: 0.0723, Train Steps/Sec: 0.07,
4721
  [2026-01-19 20:56:54] (step=0004479) Train Loss mse: 0.0507, Train Loss ce: 0.0680, Train Steps/Sec: 0.08,
4722
- base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins_step4500
4723
- Preparing Dataset vlm_gym_jigsaw_swap_celoss_evalonce/vlm_gym_jigsaw_swap_val
4724
- [eval debug] first 3 batch fingerprints:
4725
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
4726
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
4727
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
4728
- ce_avg: 0.07053616642951965, mse_avg: 0.0635463297367096
4729
- base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins_step5000
4730
- Preparing Dataset vlm_gym_jigsaw_swap_celoss_evalonce/vlm_gym_jigsaw_swap_val
4731
- [eval debug] first 3 batch fingerprints:
4732
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
4733
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
4734
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
4735
- ce_avg: 0.07069549709558487, mse_avg: 0.06882398575544357
4736
  [2026-01-19 20:57:09] (step=0004480) Train Loss mse: 0.0612, Train Loss ce: 0.0761, Train Steps/Sec: 0.07,
4737
  [2026-01-19 20:57:20] (step=0004481) Train Loss mse: 0.0472, Train Loss ce: 0.0673, Train Steps/Sec: 0.09,
4738
  [2026-01-19 20:57:34] (step=0004482) Train Loss mse: 0.0396, Train Loss ce: 0.0751, Train Steps/Sec: 0.07,
@@ -5255,4 +5255,11 @@ ce_avg: 0.07069549709558487, mse_avg: 0.06882398575544357
5255
  [2026-01-19 22:44:32] (step=0004999) Train Loss mse: 0.0512, Train Loss ce: 0.0650, Train Steps/Sec: 0.07,
5256
  [2026-01-19 22:45:38] (step=0005000) Train Loss mse: 0.0639, Train Loss ce: 0.0726, Train Steps/Sec: 0.02,
5257
  [2026-01-19 22:45:38] Saving checkpoint to /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins/0005000.
5258
- [2026-01-19 22:48:11] Done!
 
 
 
 
 
 
 
 
1
+ FullyShardedDataParallel(
2
+ (_fsdp_wrapped_module): Bagel(
3
+ (language_model): Qwen2ForCausalLM(
4
+ (model): Qwen2Model(
5
+ (embed_tokens): Embedding(152064, 3584)
6
+ (layers): ModuleList(
7
+ (0-27): 28 x FullyShardedDataParallel(
8
+ (_fsdp_wrapped_module): CheckpointWrapper(
9
+ (_checkpoint_wrapped_module): Qwen2MoTDecoderLayer(
10
+ (self_attn): PackedAttentionMoT(
11
+ (q_proj): Linear(in_features=3584, out_features=3584, bias=True)
12
+ (k_proj): Linear(in_features=3584, out_features=512, bias=True)
13
+ (v_proj): Linear(in_features=3584, out_features=512, bias=True)
14
+ (o_proj): Linear(in_features=3584, out_features=3584, bias=False)
15
+ (q_norm): Qwen2RMSNorm((128,), eps=1e-06)
16
+ (k_norm): Qwen2RMSNorm((128,), eps=1e-06)
17
+ (q_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
18
+ (k_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
19
+ (q_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=True)
20
+ (k_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
21
+ (v_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
22
+ (o_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=False)
23
+ )
24
+ (mlp): Qwen2MLP(
25
+ (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
26
+ (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
27
+ (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
28
+ (act_fn): SiLU()
29
+ )
30
+ (mlp_moe_gen): Qwen2MLP(
31
+ (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
32
+ (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
33
+ (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
34
+ (act_fn): SiLU()
35
+ )
36
+ (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
37
+ (input_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
38
+ (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
39
+ (post_attention_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
40
+ )
41
+ )
42
+ )
43
+ )
44
+ (norm): Qwen2RMSNorm((3584,), eps=1e-06)
45
+ (norm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
46
+ (rotary_emb): Qwen2RotaryEmbedding()
47
+ )
48
+ (lm_head): Linear(in_features=3584, out_features=152064, bias=False)
49
+ )
50
+ (time_embedder): FullyShardedDataParallel(
51
+ (_fsdp_wrapped_module): TimestepEmbedder(
52
+ (mlp): Sequential(
53
+ (0): Linear(in_features=256, out_features=3584, bias=True)
54
+ (1): SiLU()
55
+ (2): Linear(in_features=3584, out_features=3584, bias=True)
56
+ )
57
+ )
58
+ )
59
+ (vae2llm): Linear(in_features=64, out_features=3584, bias=True)
60
+ (llm2vae): Linear(in_features=3584, out_features=64, bias=True)
61
+ (latent_pos_embed): FullyShardedDataParallel(
62
+ (_fsdp_wrapped_module): PositionEmbedding()
63
+ )
64
+ (vit_model): SiglipVisionModel(
65
+ (vision_model): FullyShardedDataParallel(
66
+ (_fsdp_wrapped_module): SiglipVisionTransformer(
67
+ (embeddings): SiglipVisionEmbeddings(
68
+ (position_embedding): Embedding(4900, 1152)
69
+ (patch_embedding): Linear(in_features=588, out_features=1152, bias=True)
70
+ )
71
+ (encoder): SiglipEncoder(
72
+ (layers): ModuleList(
73
+ (0-25): 26 x FullyShardedDataParallel(
74
+ (_fsdp_wrapped_module): CheckpointWrapper(
75
+ (_checkpoint_wrapped_module): SiglipEncoderLayer(
76
+ (self_attn): SiglipFlashAttention2(
77
+ (k_proj): Linear(in_features=1152, out_features=1152, bias=True)
78
+ (v_proj): Linear(in_features=1152, out_features=1152, bias=True)
79
+ (q_proj): Linear(in_features=1152, out_features=1152, bias=True)
80
+ (out_proj): Linear(in_features=1152, out_features=1152, bias=True)
81
+ )
82
+ (layer_norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
83
+ (mlp): SiglipMLP(
84
+ (activation_fn): PytorchGELUTanh()
85
+ (fc1): Linear(in_features=1152, out_features=4304, bias=True)
86
+ (fc2): Linear(in_features=4304, out_features=1152, bias=True)
87
+ )
88
+ (layer_norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
89
+ )
90
+ )
91
+ )
92
+ )
93
+ )
94
+ (post_layernorm): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
95
+ )
96
+ )
97
+ )
98
+ (connector): FullyShardedDataParallel(
99
+ (_fsdp_wrapped_module): CheckpointWrapper(
100
+ (_checkpoint_wrapped_module): MLPconnector(
101
+ (activation_fn): PytorchGELUTanh()
102
+ (fc1): Linear(in_features=1152, out_features=3584, bias=True)
103
+ (fc2): Linear(in_features=3584, out_features=3584, bias=True)
104
+ )
105
+ )
106
+ )
107
+ (vit_pos_embed): FullyShardedDataParallel(
108
+ (_fsdp_wrapped_module): PositionEmbedding()
109
+ )
110
+ )
111
+ )
112
+ _flat_param True
113
+ language_model.model.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
114
+ language_model.model.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
115
+ language_model.model.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
116
+ language_model.model.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
117
+ language_model.model.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
118
+ language_model.model.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
119
+ language_model.model.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
120
+ language_model.model.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
121
+ language_model.model.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
122
+ language_model.model.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
123
+ language_model.model.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
124
+ language_model.model.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
125
+ language_model.model.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
126
+ language_model.model.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
127
+ language_model.model.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
128
+ language_model.model.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
129
+ language_model.model.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
130
+ language_model.model.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
131
+ language_model.model.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
132
+ language_model.model.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
133
+ language_model.model.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
134
+ language_model.model.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
135
+ language_model.model.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
136
+ language_model.model.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
137
+ language_model.model.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
138
+ language_model.model.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
139
+ language_model.model.layers.26._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
140
+ language_model.model.layers.27._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
141
+ time_embedder._fsdp_wrapped_module._flat_param True
142
+ latent_pos_embed._fsdp_wrapped_module._flat_param False
143
+ vit_model.vision_model._fsdp_wrapped_module._flat_param True
144
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
145
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
146
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
147
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
148
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
149
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
150
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
151
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
152
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
153
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
154
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
155
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
156
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
157
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
158
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
159
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
160
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
161
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
162
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
163
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
164
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
165
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
166
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
167
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
168
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
169
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
170
+ connector._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
171
+ vit_pos_embed._fsdp_wrapped_module._flat_param False
172
+ Preparing Dataset vlm_gym_jigsaw_swap_celoss/vlm_gym_jigsaw_swap_train
173
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins_step0
174
+ Preparing Dataset vlm_gym_jigsaw_swap_celoss_evalonce/vlm_gym_jigsaw_swap_val
175
+ [eval debug] first 3 batch fingerprints:
176
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
177
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
178
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
179
+ ce_avg: 1.0355833768844604, mse_avg: 0.09119202941656113
180
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins_step500
181
+ Preparing Dataset vlm_gym_jigsaw_swap_celoss_evalonce/vlm_gym_jigsaw_swap_val
182
+ [eval debug] first 3 batch fingerprints:
183
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
184
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
185
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
186
+ ce_avg: 0.08522484451532364, mse_avg: 0.07227052003145218
187
  wandb: Detected [huggingface_hub.inference] in use.
188
  wandb: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
189
  wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
 
1050
  [2026-01-19 08:27:49] (step=0000853) Train Loss mse: 0.0668, Train Loss ce: 0.0769, Train Steps/Sec: 0.08,
1051
  [2026-01-19 08:27:59] (step=0000854) Train Loss mse: 0.0867, Train Loss ce: 0.0813, Train Steps/Sec: 0.10,
1052
  [2026-01-19 08:28:12] (step=0000855) Train Loss mse: 0.0717, Train Loss ce: 0.0791, Train Steps/Sec: 0.07,
1053
+ [2026-01-19 08:28:24] (step=0000856) Train Loss mse: 0.0745, Train Loss ce: 0.0779, Train Steps/Sec: 0.09,
1054
+ [2026-01-19 08:28:35] (step=0000857) Train Loss mse: 0.0827, Train Loss ce: 0.0701, Train Steps/Sec: 0.09,
1055
+ [2026-01-19 08:28:48] (step=0000858) Train Loss mse: 0.0707, Train Loss ce: 0.0751, Train Steps/Sec: 0.08,
1056
+ [2026-01-19 08:29:03] (step=0000859) Train Loss mse: 0.0528, Train Loss ce: 0.0762, Train Steps/Sec: 0.07,
1057
+ [2026-01-19 08:29:14] (step=0000860) Train Loss mse: 0.0871, Train Loss ce: 0.0803, Train Steps/Sec: 0.09,
1058
+ [2026-01-19 08:29:25] (step=0000861) Train Loss mse: 0.0612, Train Loss ce: 0.0715, Train Steps/Sec: 0.09,
1059
+ [2026-01-19 08:29:36] (step=0000862) Train Loss mse: 0.0526, Train Loss ce: 0.0828, Train Steps/Sec: 0.09,
1060
+ [2026-01-19 08:29:47] (step=0000863) Train Loss mse: 0.1003, Train Loss ce: 0.0734, Train Steps/Sec: 0.09,
1061
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins_step1000
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1062
  Preparing Dataset vlm_gym_jigsaw_swap_celoss_evalonce/vlm_gym_jigsaw_swap_val
1063
  [eval debug] first 3 batch fingerprints:
1064
  fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
1065
  fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
1066
  fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
1067
+ ce_avg: 0.11199185252189636, mse_avg: 0.06597896665334702
1068
  base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins_step1500
1069
  Preparing Dataset vlm_gym_jigsaw_swap_celoss_evalonce/vlm_gym_jigsaw_swap_val
1070
  [eval debug] first 3 batch fingerprints:
 
1072
  fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
1073
  fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
1074
  ce_avg: 0.14740489423274994, mse_avg: 0.06398878246545792
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1075
  [2026-01-19 08:29:58] (step=0000864) Train Loss mse: 0.0765, Train Loss ce: 0.0725, Train Steps/Sec: 0.09,
1076
  [2026-01-19 08:30:13] (step=0000865) Train Loss mse: 0.0489, Train Loss ce: 0.0778, Train Steps/Sec: 0.07,
1077
  [2026-01-19 08:30:23] (step=0000866) Train Loss mse: 0.0780, Train Loss ce: 0.0754, Train Steps/Sec: 0.09,
 
2296
  [2026-01-19 12:42:08] (step=0002085) Train Loss mse: 0.0849, Train Loss ce: 0.0714, Train Steps/Sec: 0.10,
2297
  [2026-01-19 12:42:18] (step=0002086) Train Loss mse: 0.0803, Train Loss ce: 0.0749, Train Steps/Sec: 0.10,
2298
  [2026-01-19 12:42:32] (step=0002087) Train Loss mse: 0.0616, Train Loss ce: 0.0706, Train Steps/Sec: 0.07,
2299
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins_step2000
2300
+ Preparing Dataset vlm_gym_jigsaw_swap_celoss_evalonce/vlm_gym_jigsaw_swap_val
2301
+ [eval debug] first 3 batch fingerprints:
2302
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
2303
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
2304
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
2305
+ ce_avg: 0.17242415249347687, mse_avg: 0.06874960660934448
2306
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins_step2500
2307
+ Preparing Dataset vlm_gym_jigsaw_swap_celoss_evalonce/vlm_gym_jigsaw_swap_val
2308
+ [eval debug] first 3 batch fingerprints:
2309
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
2310
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
2311
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
2312
+ ce_avg: 0.19842223823070526, mse_avg: 0.06336650252342224
2313
  [2026-01-19 12:42:44] (step=0002088) Train Loss mse: 0.0685, Train Loss ce: 0.0754, Train Steps/Sec: 0.08,
2314
  [2026-01-19 12:42:56] (step=0002089) Train Loss mse: 0.0967, Train Loss ce: 0.0695, Train Steps/Sec: 0.09,
2315
  [2026-01-19 12:43:07] (step=0002090) Train Loss mse: 0.0986, Train Loss ce: 0.0797, Train Steps/Sec: 0.09,
 
2422
  [2026-01-19 13:05:10] (step=0002197) Train Loss mse: 0.0664, Train Loss ce: 0.0750, Train Steps/Sec: 0.09,
2423
  [2026-01-19 13:05:20] (step=0002198) Train Loss mse: 0.0880, Train Loss ce: 0.0704, Train Steps/Sec: 0.09,
2424
  [2026-01-19 13:05:33] (step=0002199) Train Loss mse: 0.0675, Train Loss ce: 0.0757, Train Steps/Sec: 0.08,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2425
  [2026-01-19 13:05:44] (step=0002200) Train Loss mse: 0.0508, Train Loss ce: 0.0733, Train Steps/Sec: 0.09,
2426
  [2026-01-19 13:05:55] (step=0002201) Train Loss mse: 0.1194, Train Loss ce: 0.0703, Train Steps/Sec: 0.09,
2427
  [2026-01-19 13:06:08] (step=0002202) Train Loss mse: 0.0828, Train Loss ce: 0.0721, Train Steps/Sec: 0.08,
 
3187
  [2026-01-19 15:43:48] (step=0002959) Train Loss mse: 0.0704, Train Loss ce: 0.0697, Train Steps/Sec: 0.07,
3188
  [2026-01-19 15:44:02] (step=0002960) Train Loss mse: 0.0800, Train Loss ce: 0.0681, Train Steps/Sec: 0.07,
3189
  [2026-01-19 15:44:14] (step=0002961) Train Loss mse: 0.0970, Train Loss ce: 0.0651, Train Steps/Sec: 0.09,
3190
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins_step3000
3191
+ Preparing Dataset vlm_gym_jigsaw_swap_celoss_evalonce/vlm_gym_jigsaw_swap_val
3192
+ [eval debug] first 3 batch fingerprints:
3193
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
3194
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
3195
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
3196
+ ce_avg: 0.07113409787416458, mse_avg: 0.07119625806808472
3197
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins_step3500
3198
+ Preparing Dataset vlm_gym_jigsaw_swap_celoss_evalonce/vlm_gym_jigsaw_swap_val
3199
+ [eval debug] first 3 batch fingerprints:
3200
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
3201
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
3202
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
3203
+ ce_avg: 0.07160378992557526, mse_avg: 0.0721038281917572
3204
  [2026-01-19 15:44:23] (step=0002962) Train Loss mse: 0.0561, Train Loss ce: 0.0671, Train Steps/Sec: 0.10,
3205
  [2026-01-19 15:44:36] (step=0002963) Train Loss mse: 0.0818, Train Loss ce: 0.0704, Train Steps/Sec: 0.08,
3206
  [2026-01-19 15:44:44] (step=0002964) Train Loss mse: 0.0788, Train Loss ce: 0.0692, Train Steps/Sec: 0.11,
 
3489
  [2026-01-19 16:43:08] (step=0003247) Train Loss mse: 0.0538, Train Loss ce: 0.0729, Train Steps/Sec: 0.09,
3490
  [2026-01-19 16:43:18] (step=0003248) Train Loss mse: 0.1023, Train Loss ce: 0.0670, Train Steps/Sec: 0.10,
3491
  [2026-01-19 16:43:30] (step=0003249) Train Loss mse: 0.0763, Train Loss ce: 0.0710, Train Steps/Sec: 0.08,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3492
  [2026-01-19 16:43:41] (step=0003250) Train Loss mse: 0.0678, Train Loss ce: 0.0710, Train Steps/Sec: 0.09,
3493
  [2026-01-19 16:43:54] (step=0003251) Train Loss mse: 0.1078, Train Loss ce: 0.0676, Train Steps/Sec: 0.08,
3494
  [2026-01-19 16:44:07] (step=0003252) Train Loss mse: 0.0562, Train Loss ce: 0.0690, Train Steps/Sec: 0.07,
 
4369
  [2026-01-19 19:45:26] (step=0004127) Train Loss mse: 0.0530, Train Loss ce: 0.0686, Train Steps/Sec: 0.09,
4370
  [2026-01-19 19:45:37] (step=0004128) Train Loss mse: 0.0773, Train Loss ce: 0.0697, Train Steps/Sec: 0.09,
4371
  [2026-01-19 19:45:49] (step=0004129) Train Loss mse: 0.0693, Train Loss ce: 0.0689, Train Steps/Sec: 0.08,
4372
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins_step4000
4373
+ Preparing Dataset vlm_gym_jigsaw_swap_celoss_evalonce/vlm_gym_jigsaw_swap_val
4374
+ [eval debug] first 3 batch fingerprints:
4375
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
4376
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
4377
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
4378
+ ce_avg: 0.07089916616678238, mse_avg: 0.0642944946885109
4379
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins_step4500
4380
+ Preparing Dataset vlm_gym_jigsaw_swap_celoss_evalonce/vlm_gym_jigsaw_swap_val
4381
+ [eval debug] first 3 batch fingerprints:
4382
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
4383
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
4384
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
4385
+ ce_avg: 0.07053616642951965, mse_avg: 0.0635463297367096
4386
  [2026-01-19 19:45:59] (step=0004130) Train Loss mse: 0.0523, Train Loss ce: 0.0727, Train Steps/Sec: 0.10,
4387
  [2026-01-19 19:46:12] (step=0004131) Train Loss mse: 0.0772, Train Loss ce: 0.0679, Train Steps/Sec: 0.07,
4388
  [2026-01-19 19:46:26] (step=0004132) Train Loss mse: 0.0684, Train Loss ce: 0.0704, Train Steps/Sec: 0.07,
 
4733
  [2026-01-19 20:56:28] (step=0004477) Train Loss mse: 0.0562, Train Loss ce: 0.0708, Train Steps/Sec: 0.07,
4734
  [2026-01-19 20:56:42] (step=0004478) Train Loss mse: 0.0685, Train Loss ce: 0.0723, Train Steps/Sec: 0.07,
4735
  [2026-01-19 20:56:54] (step=0004479) Train Loss mse: 0.0507, Train Loss ce: 0.0680, Train Steps/Sec: 0.08,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4736
  [2026-01-19 20:57:09] (step=0004480) Train Loss mse: 0.0612, Train Loss ce: 0.0761, Train Steps/Sec: 0.07,
4737
  [2026-01-19 20:57:20] (step=0004481) Train Loss mse: 0.0472, Train Loss ce: 0.0673, Train Steps/Sec: 0.09,
4738
  [2026-01-19 20:57:34] (step=0004482) Train Loss mse: 0.0396, Train Loss ce: 0.0751, Train Steps/Sec: 0.07,
 
5255
  [2026-01-19 22:44:32] (step=0004999) Train Loss mse: 0.0512, Train Loss ce: 0.0650, Train Steps/Sec: 0.07,
5256
  [2026-01-19 22:45:38] (step=0005000) Train Loss mse: 0.0639, Train Loss ce: 0.0726, Train Steps/Sec: 0.02,
5257
  [2026-01-19 22:45:38] Saving checkpoint to /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins/0005000.
5258
+ [2026-01-19 22:48:11] Done!
5259
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ins_step5000
5260
+ Preparing Dataset vlm_gym_jigsaw_swap_celoss_evalonce/vlm_gym_jigsaw_swap_val
5261
+ [eval debug] first 3 batch fingerprints:
5262
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
5263
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
5264
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_swap_celoss_evalonce'}]
5265
+ ce_avg: 0.07069549709558487, mse_avg: 0.06882398575544357