Junyi42 commited on
Commit
1357eb9
·
verified ·
1 Parent(s): 8c6e6d8

Upload checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_mse_only/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_mse_only

Browse files
checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_mse_only/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_mse_only/wandb/offline-run-20251230_194213-vlm_gym_jigsaw_one_img_lr2e_5_mse_only-run0/files/output.log CHANGED
@@ -1,3 +1,176 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  wandb: Detected [huggingface_hub.inference] in use.
2
  wandb: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
3
  wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
@@ -867,179 +1040,6 @@ wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
867
  [2025-12-30 23:06:30] (step=0000856) Train Loss mse: 0.0393, Train Loss ce: 0.0000, Train Steps/Sec: 0.07,
868
  [2025-12-30 23:06:43] (step=0000857) Train Loss mse: 0.0311, Train Loss ce: 0.0000, Train Steps/Sec: 0.08,
869
  [2025-12-30 23:06:59] (step=0000858) Train Loss mse: 0.0477, Train Loss ce: 0.0000, Train Steps/Sec: 0.06,
870
- FullyShardedDataParallel(
871
- (_fsdp_wrapped_module): Bagel(
872
- (language_model): Qwen2ForCausalLM(
873
- (model): Qwen2Model(
874
- (embed_tokens): Embedding(152064, 3584)
875
- (layers): ModuleList(
876
- (0-27): 28 x FullyShardedDataParallel(
877
- (_fsdp_wrapped_module): CheckpointWrapper(
878
- (_checkpoint_wrapped_module): Qwen2MoTDecoderLayer(
879
- (self_attn): PackedAttentionMoT(
880
- (q_proj): Linear(in_features=3584, out_features=3584, bias=True)
881
- (k_proj): Linear(in_features=3584, out_features=512, bias=True)
882
- (v_proj): Linear(in_features=3584, out_features=512, bias=True)
883
- (o_proj): Linear(in_features=3584, out_features=3584, bias=False)
884
- (q_norm): Qwen2RMSNorm((128,), eps=1e-06)
885
- (k_norm): Qwen2RMSNorm((128,), eps=1e-06)
886
- (q_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
887
- (k_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
888
- (q_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=True)
889
- (k_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
890
- (v_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
891
- (o_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=False)
892
- )
893
- (mlp): Qwen2MLP(
894
- (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
895
- (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
896
- (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
897
- (act_fn): SiLU()
898
- )
899
- (mlp_moe_gen): Qwen2MLP(
900
- (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
901
- (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
902
- (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
903
- (act_fn): SiLU()
904
- )
905
- (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
906
- (input_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
907
- (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
908
- (post_attention_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
909
- )
910
- )
911
- )
912
- )
913
- (norm): Qwen2RMSNorm((3584,), eps=1e-06)
914
- (norm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
915
- (rotary_emb): Qwen2RotaryEmbedding()
916
- )
917
- (lm_head): Linear(in_features=3584, out_features=152064, bias=False)
918
- )
919
- (time_embedder): FullyShardedDataParallel(
920
- (_fsdp_wrapped_module): TimestepEmbedder(
921
- (mlp): Sequential(
922
- (0): Linear(in_features=256, out_features=3584, bias=True)
923
- (1): SiLU()
924
- (2): Linear(in_features=3584, out_features=3584, bias=True)
925
- )
926
- )
927
- )
928
- (vae2llm): Linear(in_features=64, out_features=3584, bias=True)
929
- (llm2vae): Linear(in_features=3584, out_features=64, bias=True)
930
- (latent_pos_embed): FullyShardedDataParallel(
931
- (_fsdp_wrapped_module): PositionEmbedding()
932
- )
933
- (vit_model): SiglipVisionModel(
934
- (vision_model): FullyShardedDataParallel(
935
- (_fsdp_wrapped_module): SiglipVisionTransformer(
936
- (embeddings): SiglipVisionEmbeddings(
937
- (position_embedding): Embedding(4900, 1152)
938
- (patch_embedding): Linear(in_features=588, out_features=1152, bias=True)
939
- )
940
- (encoder): SiglipEncoder(
941
- (layers): ModuleList(
942
- (0-25): 26 x FullyShardedDataParallel(
943
- (_fsdp_wrapped_module): CheckpointWrapper(
944
- (_checkpoint_wrapped_module): SiglipEncoderLayer(
945
- (self_attn): SiglipFlashAttention2(
946
- (k_proj): Linear(in_features=1152, out_features=1152, bias=True)
947
- (v_proj): Linear(in_features=1152, out_features=1152, bias=True)
948
- (q_proj): Linear(in_features=1152, out_features=1152, bias=True)
949
- (out_proj): Linear(in_features=1152, out_features=1152, bias=True)
950
- )
951
- (layer_norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
952
- (mlp): SiglipMLP(
953
- (activation_fn): PytorchGELUTanh()
954
- (fc1): Linear(in_features=1152, out_features=4304, bias=True)
955
- (fc2): Linear(in_features=4304, out_features=1152, bias=True)
956
- )
957
- (layer_norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
958
- )
959
- )
960
- )
961
- )
962
- )
963
- (post_layernorm): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
964
- )
965
- )
966
- )
967
- (connector): FullyShardedDataParallel(
968
- (_fsdp_wrapped_module): CheckpointWrapper(
969
- (_checkpoint_wrapped_module): MLPconnector(
970
- (activation_fn): PytorchGELUTanh()
971
- (fc1): Linear(in_features=1152, out_features=3584, bias=True)
972
- (fc2): Linear(in_features=3584, out_features=3584, bias=True)
973
- )
974
- )
975
- )
976
- (vit_pos_embed): FullyShardedDataParallel(
977
- (_fsdp_wrapped_module): PositionEmbedding()
978
- )
979
- )
980
- )
981
- _flat_param True
982
- language_model.model.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
983
- language_model.model.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
984
- language_model.model.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
985
- language_model.model.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
986
- language_model.model.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
987
- language_model.model.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
988
- language_model.model.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
989
- language_model.model.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
990
- language_model.model.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
991
- language_model.model.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
992
- language_model.model.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
993
- language_model.model.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
994
- language_model.model.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
995
- language_model.model.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
996
- language_model.model.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
997
- language_model.model.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
998
- language_model.model.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
999
- language_model.model.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1000
- language_model.model.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1001
- language_model.model.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1002
- language_model.model.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1003
- language_model.model.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1004
- language_model.model.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1005
- language_model.model.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1006
- language_model.model.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1007
- language_model.model.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1008
- language_model.model.layers.26._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1009
- language_model.model.layers.27._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1010
- time_embedder._fsdp_wrapped_module._flat_param True
1011
- latent_pos_embed._fsdp_wrapped_module._flat_param False
1012
- vit_model.vision_model._fsdp_wrapped_module._flat_param True
1013
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1014
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1015
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1016
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1017
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1018
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1019
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1020
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1021
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1022
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1023
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1024
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1025
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1026
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1027
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1028
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1029
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1030
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1031
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1032
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1033
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1034
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1035
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1036
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1037
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1038
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1039
- connector._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1040
- vit_pos_embed._fsdp_wrapped_module._flat_param False
1041
- Preparing Dataset vlm_gym_jigsaw_mse_loss_only/vlm_gym_jigsaw_train
1042
- Preparing Dataset vlm_gym_jigsaw_mse_loss_only/vlm_gym_jigsaw_train
1043
  [2025-12-30 23:07:15] (step=0000859) Train Loss mse: 0.0401, Train Loss ce: 0.0000, Train Steps/Sec: 0.06,
1044
  [2025-12-30 23:07:28] (step=0000860) Train Loss mse: 0.0433, Train Loss ce: 0.0000, Train Steps/Sec: 0.08,
1045
  [2025-12-30 23:07:39] (step=0000861) Train Loss mse: 0.0504, Train Loss ce: 0.0000, Train Steps/Sec: 0.09,
 
1
+ FullyShardedDataParallel(
2
+ (_fsdp_wrapped_module): Bagel(
3
+ (language_model): Qwen2ForCausalLM(
4
+ (model): Qwen2Model(
5
+ (embed_tokens): Embedding(152064, 3584)
6
+ (layers): ModuleList(
7
+ (0-27): 28 x FullyShardedDataParallel(
8
+ (_fsdp_wrapped_module): CheckpointWrapper(
9
+ (_checkpoint_wrapped_module): Qwen2MoTDecoderLayer(
10
+ (self_attn): PackedAttentionMoT(
11
+ (q_proj): Linear(in_features=3584, out_features=3584, bias=True)
12
+ (k_proj): Linear(in_features=3584, out_features=512, bias=True)
13
+ (v_proj): Linear(in_features=3584, out_features=512, bias=True)
14
+ (o_proj): Linear(in_features=3584, out_features=3584, bias=False)
15
+ (q_norm): Qwen2RMSNorm((128,), eps=1e-06)
16
+ (k_norm): Qwen2RMSNorm((128,), eps=1e-06)
17
+ (q_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
18
+ (k_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
19
+ (q_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=True)
20
+ (k_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
21
+ (v_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
22
+ (o_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=False)
23
+ )
24
+ (mlp): Qwen2MLP(
25
+ (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
26
+ (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
27
+ (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
28
+ (act_fn): SiLU()
29
+ )
30
+ (mlp_moe_gen): Qwen2MLP(
31
+ (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
32
+ (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
33
+ (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
34
+ (act_fn): SiLU()
35
+ )
36
+ (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
37
+ (input_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
38
+ (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
39
+ (post_attention_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
40
+ )
41
+ )
42
+ )
43
+ )
44
+ (norm): Qwen2RMSNorm((3584,), eps=1e-06)
45
+ (norm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
46
+ (rotary_emb): Qwen2RotaryEmbedding()
47
+ )
48
+ (lm_head): Linear(in_features=3584, out_features=152064, bias=False)
49
+ )
50
+ (time_embedder): FullyShardedDataParallel(
51
+ (_fsdp_wrapped_module): TimestepEmbedder(
52
+ (mlp): Sequential(
53
+ (0): Linear(in_features=256, out_features=3584, bias=True)
54
+ (1): SiLU()
55
+ (2): Linear(in_features=3584, out_features=3584, bias=True)
56
+ )
57
+ )
58
+ )
59
+ (vae2llm): Linear(in_features=64, out_features=3584, bias=True)
60
+ (llm2vae): Linear(in_features=3584, out_features=64, bias=True)
61
+ (latent_pos_embed): FullyShardedDataParallel(
62
+ (_fsdp_wrapped_module): PositionEmbedding()
63
+ )
64
+ (vit_model): SiglipVisionModel(
65
+ (vision_model): FullyShardedDataParallel(
66
+ (_fsdp_wrapped_module): SiglipVisionTransformer(
67
+ (embeddings): SiglipVisionEmbeddings(
68
+ (position_embedding): Embedding(4900, 1152)
69
+ (patch_embedding): Linear(in_features=588, out_features=1152, bias=True)
70
+ )
71
+ (encoder): SiglipEncoder(
72
+ (layers): ModuleList(
73
+ (0-25): 26 x FullyShardedDataParallel(
74
+ (_fsdp_wrapped_module): CheckpointWrapper(
75
+ (_checkpoint_wrapped_module): SiglipEncoderLayer(
76
+ (self_attn): SiglipFlashAttention2(
77
+ (k_proj): Linear(in_features=1152, out_features=1152, bias=True)
78
+ (v_proj): Linear(in_features=1152, out_features=1152, bias=True)
79
+ (q_proj): Linear(in_features=1152, out_features=1152, bias=True)
80
+ (out_proj): Linear(in_features=1152, out_features=1152, bias=True)
81
+ )
82
+ (layer_norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
83
+ (mlp): SiglipMLP(
84
+ (activation_fn): PytorchGELUTanh()
85
+ (fc1): Linear(in_features=1152, out_features=4304, bias=True)
86
+ (fc2): Linear(in_features=4304, out_features=1152, bias=True)
87
+ )
88
+ (layer_norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
89
+ )
90
+ )
91
+ )
92
+ )
93
+ )
94
+ (post_layernorm): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
95
+ )
96
+ )
97
+ )
98
+ (connector): FullyShardedDataParallel(
99
+ (_fsdp_wrapped_module): CheckpointWrapper(
100
+ (_checkpoint_wrapped_module): MLPconnector(
101
+ (activation_fn): PytorchGELUTanh()
102
+ (fc1): Linear(in_features=1152, out_features=3584, bias=True)
103
+ (fc2): Linear(in_features=3584, out_features=3584, bias=True)
104
+ )
105
+ )
106
+ )
107
+ (vit_pos_embed): FullyShardedDataParallel(
108
+ (_fsdp_wrapped_module): PositionEmbedding()
109
+ )
110
+ )
111
+ )
112
+ _flat_param True
113
+ language_model.model.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
114
+ language_model.model.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
115
+ language_model.model.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
116
+ language_model.model.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
117
+ language_model.model.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
118
+ language_model.model.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
119
+ language_model.model.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
120
+ language_model.model.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
121
+ language_model.model.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
122
+ language_model.model.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
123
+ language_model.model.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
124
+ language_model.model.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
125
+ language_model.model.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
126
+ language_model.model.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
127
+ language_model.model.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
128
+ language_model.model.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
129
+ language_model.model.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
130
+ language_model.model.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
131
+ language_model.model.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
132
+ language_model.model.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
133
+ language_model.model.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
134
+ language_model.model.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
135
+ language_model.model.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
136
+ language_model.model.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
137
+ language_model.model.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
138
+ language_model.model.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
139
+ language_model.model.layers.26._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
140
+ language_model.model.layers.27._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
141
+ time_embedder._fsdp_wrapped_module._flat_param True
142
+ latent_pos_embed._fsdp_wrapped_module._flat_param False
143
+ vit_model.vision_model._fsdp_wrapped_module._flat_param True
144
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
145
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
146
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
147
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
148
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
149
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
150
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
151
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
152
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
153
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
154
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
155
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
156
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
157
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
158
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
159
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
160
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
161
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
162
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
163
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
164
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
165
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
166
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
167
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
168
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
169
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
170
+ connector._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
171
+ vit_pos_embed._fsdp_wrapped_module._flat_param False
172
+ Preparing Dataset vlm_gym_jigsaw_mse_loss_only/vlm_gym_jigsaw_train
173
+ Preparing Dataset vlm_gym_jigsaw_mse_loss_only/vlm_gym_jigsaw_train
174
  wandb: Detected [huggingface_hub.inference] in use.
175
  wandb: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
176
  wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
 
1040
  [2025-12-30 23:06:30] (step=0000856) Train Loss mse: 0.0393, Train Loss ce: 0.0000, Train Steps/Sec: 0.07,
1041
  [2025-12-30 23:06:43] (step=0000857) Train Loss mse: 0.0311, Train Loss ce: 0.0000, Train Steps/Sec: 0.08,
1042
  [2025-12-30 23:06:59] (step=0000858) Train Loss mse: 0.0477, Train Loss ce: 0.0000, Train Steps/Sec: 0.06,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1043
  [2025-12-30 23:07:15] (step=0000859) Train Loss mse: 0.0401, Train Loss ce: 0.0000, Train Steps/Sec: 0.06,
1044
  [2025-12-30 23:07:28] (step=0000860) Train Loss mse: 0.0433, Train Loss ce: 0.0000, Train Steps/Sec: 0.08,
1045
  [2025-12-30 23:07:39] (step=0000861) Train Loss mse: 0.0504, Train Loss ce: 0.0000, Train Steps/Sec: 0.09,
checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_mse_only/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_mse_only/wandb/offline-run-20260102_214304-vlm_gym_jigsaw_one_img_lr2e_5_mse_only-run0/files/output.log CHANGED
@@ -869,6 +869,17 @@ wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
869
  [2026-01-03 01:07:43] (step=0000858) Train Loss mse: 0.0470, Train Loss ce: 0.0000, Train Steps/Sec: 0.06,
870
  [2026-01-03 01:07:59] (step=0000859) Train Loss mse: 0.0402, Train Loss ce: 0.0000, Train Steps/Sec: 0.06,
871
  [2026-01-03 01:08:12] (step=0000860) Train Loss mse: 0.0432, Train Loss ce: 0.0000, Train Steps/Sec: 0.08,
 
 
 
 
 
 
 
 
 
 
 
872
  FullyShardedDataParallel(
873
  (_fsdp_wrapped_module): Bagel(
874
  (language_model): Qwen2ForCausalLM(
@@ -1042,17 +1053,6 @@ connector._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1042
  vit_pos_embed._fsdp_wrapped_module._flat_param False
1043
  Preparing Dataset vlm_gym_jigsaw_mse_loss_only/vlm_gym_jigsaw_train
1044
  Preparing Dataset vlm_gym_jigsaw_mse_loss_only/vlm_gym_jigsaw_train
1045
- [2026-01-03 01:08:24] (step=0000861) Train Loss mse: 0.0504, Train Loss ce: 0.0000, Train Steps/Sec: 0.09,
1046
- [2026-01-03 01:08:37] (step=0000862) Train Loss mse: 0.0421, Train Loss ce: 0.0000, Train Steps/Sec: 0.07,
1047
- [2026-01-03 01:08:51] (step=0000863) Train Loss mse: 0.0328, Train Loss ce: 0.0000, Train Steps/Sec: 0.07,
1048
- [2026-01-03 01:09:02] (step=0000864) Train Loss mse: 0.0444, Train Loss ce: 0.0000, Train Steps/Sec: 0.09,
1049
- [2026-01-03 01:09:16] (step=0000865) Train Loss mse: 0.0274, Train Loss ce: 0.0000, Train Steps/Sec: 0.07,
1050
- [2026-01-03 01:09:29] (step=0000866) Train Loss mse: 0.0303, Train Loss ce: 0.0000, Train Steps/Sec: 0.08,
1051
- [2026-01-03 01:09:45] (step=0000867) Train Loss mse: 0.0494, Train Loss ce: 0.0000, Train Steps/Sec: 0.06,
1052
- [2026-01-03 01:09:58] (step=0000868) Train Loss mse: 0.0309, Train Loss ce: 0.0000, Train Steps/Sec: 0.08,
1053
- [2026-01-03 01:10:14] (step=0000869) Train Loss mse: 0.0313, Train Loss ce: 0.0000, Train Steps/Sec: 0.06,
1054
- [2026-01-03 01:10:28] (step=0000870) Train Loss mse: 0.0386, Train Loss ce: 0.0000, Train Steps/Sec: 0.07,
1055
- [2026-01-03 01:10:44] (step=0000871) Train Loss mse: 0.0403, Train Loss ce: 0.0000, Train Steps/Sec: 0.06,
1056
  [2026-01-03 01:11:00] (step=0000872) Train Loss mse: 0.0435, Train Loss ce: 0.0000, Train Steps/Sec: 0.06,
1057
  [2026-01-03 01:11:13] (step=0000873) Train Loss mse: 0.0390, Train Loss ce: 0.0000, Train Steps/Sec: 0.08,
1058
  [2026-01-03 01:11:29] (step=0000874) Train Loss mse: 0.0406, Train Loss ce: 0.0000, Train Steps/Sec: 0.06,
 
869
  [2026-01-03 01:07:43] (step=0000858) Train Loss mse: 0.0470, Train Loss ce: 0.0000, Train Steps/Sec: 0.06,
870
  [2026-01-03 01:07:59] (step=0000859) Train Loss mse: 0.0402, Train Loss ce: 0.0000, Train Steps/Sec: 0.06,
871
  [2026-01-03 01:08:12] (step=0000860) Train Loss mse: 0.0432, Train Loss ce: 0.0000, Train Steps/Sec: 0.08,
872
+ [2026-01-03 01:08:24] (step=0000861) Train Loss mse: 0.0504, Train Loss ce: 0.0000, Train Steps/Sec: 0.09,
873
+ [2026-01-03 01:08:37] (step=0000862) Train Loss mse: 0.0421, Train Loss ce: 0.0000, Train Steps/Sec: 0.07,
874
+ [2026-01-03 01:08:51] (step=0000863) Train Loss mse: 0.0328, Train Loss ce: 0.0000, Train Steps/Sec: 0.07,
875
+ [2026-01-03 01:09:02] (step=0000864) Train Loss mse: 0.0444, Train Loss ce: 0.0000, Train Steps/Sec: 0.09,
876
+ [2026-01-03 01:09:16] (step=0000865) Train Loss mse: 0.0274, Train Loss ce: 0.0000, Train Steps/Sec: 0.07,
877
+ [2026-01-03 01:09:29] (step=0000866) Train Loss mse: 0.0303, Train Loss ce: 0.0000, Train Steps/Sec: 0.08,
878
+ [2026-01-03 01:09:45] (step=0000867) Train Loss mse: 0.0494, Train Loss ce: 0.0000, Train Steps/Sec: 0.06,
879
+ [2026-01-03 01:09:58] (step=0000868) Train Loss mse: 0.0309, Train Loss ce: 0.0000, Train Steps/Sec: 0.08,
880
+ [2026-01-03 01:10:14] (step=0000869) Train Loss mse: 0.0313, Train Loss ce: 0.0000, Train Steps/Sec: 0.06,
881
+ [2026-01-03 01:10:28] (step=0000870) Train Loss mse: 0.0386, Train Loss ce: 0.0000, Train Steps/Sec: 0.07,
882
+ [2026-01-03 01:10:44] (step=0000871) Train Loss mse: 0.0403, Train Loss ce: 0.0000, Train Steps/Sec: 0.06,
883
  FullyShardedDataParallel(
884
  (_fsdp_wrapped_module): Bagel(
885
  (language_model): Qwen2ForCausalLM(
 
1053
  vit_pos_embed._fsdp_wrapped_module._flat_param False
1054
  Preparing Dataset vlm_gym_jigsaw_mse_loss_only/vlm_gym_jigsaw_train
1055
  Preparing Dataset vlm_gym_jigsaw_mse_loss_only/vlm_gym_jigsaw_train
 
 
 
 
 
 
 
 
 
 
 
1056
  [2026-01-03 01:11:00] (step=0000872) Train Loss mse: 0.0435, Train Loss ce: 0.0000, Train Steps/Sec: 0.06,
1057
  [2026-01-03 01:11:13] (step=0000873) Train Loss mse: 0.0390, Train Loss ce: 0.0000, Train Steps/Sec: 0.08,
1058
  [2026-01-03 01:11:29] (step=0000874) Train Loss mse: 0.0406, Train Loss ce: 0.0000, Train Steps/Sec: 0.06,