Junyi42 commited on
Commit
cd6e0d0
·
verified ·
1 Parent(s): 2777511

Upload checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins

Browse files
checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins/wandb/offline-run-20260126_213949-checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins-run0/files/output.log CHANGED
@@ -1,3 +1,196 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  wandb: Detected [huggingface_hub.inference] in use.
2
  wandb: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
3
  wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
@@ -1018,6 +1211,20 @@ wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
1018
  [2026-01-26 22:12:48] (step=0001007) Train Loss mse: 0.0116, Train Loss ce: 0.0712, Train Steps/Sec: 0.58,
1019
  [2026-01-26 22:12:50] (step=0001008) Train Loss mse: 0.0130, Train Loss ce: 0.0336, Train Steps/Sec: 0.68,
1020
  [2026-01-26 22:12:51] (step=0001009) Train Loss mse: 0.0141, Train Loss ce: 0.0598, Train Steps/Sec: 0.68,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1021
  [2026-01-26 22:12:53] (step=0001010) Train Loss mse: 0.0162, Train Loss ce: 0.0454, Train Steps/Sec: 0.68,
1022
  [2026-01-26 22:12:54] (step=0001011) Train Loss mse: 0.0109, Train Loss ce: 0.0345, Train Steps/Sec: 0.68,
1023
  [2026-01-26 22:12:56] (step=0001012) Train Loss mse: 0.0095, Train Loss ce: 0.0584, Train Steps/Sec: 0.68,
@@ -1065,220 +1272,6 @@ wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
1065
  [2026-01-26 22:14:01] (step=0001054) Train Loss mse: 0.0175, Train Loss ce: 0.0570, Train Steps/Sec: 0.68,
1066
  [2026-01-26 22:14:03] (step=0001055) Train Loss mse: 0.0202, Train Loss ce: 0.0187, Train Steps/Sec: 0.68,
1067
  [2026-01-26 22:14:04] (step=0001056) Train Loss mse: 0.0158, Train Loss ce: 0.0599, Train Steps/Sec: 0.68,
1068
- FullyShardedDataParallel(
1069
- (_fsdp_wrapped_module): Bagel(
1070
- (language_model): Qwen2ForCausalLM(
1071
- (model): Qwen2Model(
1072
- (embed_tokens): Embedding(152064, 3584)
1073
- (layers): ModuleList(
1074
- (0-27): 28 x FullyShardedDataParallel(
1075
- (_fsdp_wrapped_module): CheckpointWrapper(
1076
- (_checkpoint_wrapped_module): Qwen2MoTDecoderLayer(
1077
- (self_attn): PackedAttentionMoT(
1078
- (q_proj): Linear(in_features=3584, out_features=3584, bias=True)
1079
- (k_proj): Linear(in_features=3584, out_features=512, bias=True)
1080
- (v_proj): Linear(in_features=3584, out_features=512, bias=True)
1081
- (o_proj): Linear(in_features=3584, out_features=3584, bias=False)
1082
- (q_norm): Qwen2RMSNorm((128,), eps=1e-06)
1083
- (k_norm): Qwen2RMSNorm((128,), eps=1e-06)
1084
- (q_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
1085
- (k_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
1086
- (q_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=True)
1087
- (k_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
1088
- (v_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
1089
- (o_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=False)
1090
- )
1091
- (mlp): Qwen2MLP(
1092
- (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
1093
- (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
1094
- (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
1095
- (act_fn): SiLU()
1096
- )
1097
- (mlp_moe_gen): Qwen2MLP(
1098
- (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
1099
- (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
1100
- (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
1101
- (act_fn): SiLU()
1102
- )
1103
- (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
1104
- (input_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
1105
- (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
1106
- (post_attention_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
1107
- )
1108
- )
1109
- )
1110
- )
1111
- (norm): Qwen2RMSNorm((3584,), eps=1e-06)
1112
- (norm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
1113
- (rotary_emb): Qwen2RotaryEmbedding()
1114
- )
1115
- (lm_head): Linear(in_features=3584, out_features=152064, bias=False)
1116
- )
1117
- (time_embedder): FullyShardedDataParallel(
1118
- (_fsdp_wrapped_module): TimestepEmbedder(
1119
- (mlp): Sequential(
1120
- (0): Linear(in_features=256, out_features=3584, bias=True)
1121
- (1): SiLU()
1122
- (2): Linear(in_features=3584, out_features=3584, bias=True)
1123
- )
1124
- )
1125
- )
1126
- (vae2llm): Linear(in_features=64, out_features=3584, bias=True)
1127
- (llm2vae): Linear(in_features=3584, out_features=64, bias=True)
1128
- (latent_pos_embed): FullyShardedDataParallel(
1129
- (_fsdp_wrapped_module): PositionEmbedding()
1130
- )
1131
- (vit_model): SiglipVisionModel(
1132
- (vision_model): FullyShardedDataParallel(
1133
- (_fsdp_wrapped_module): SiglipVisionTransformer(
1134
- (embeddings): SiglipVisionEmbeddings(
1135
- (position_embedding): Embedding(4900, 1152)
1136
- (patch_embedding): Linear(in_features=588, out_features=1152, bias=True)
1137
- )
1138
- (encoder): SiglipEncoder(
1139
- (layers): ModuleList(
1140
- (0-25): 26 x FullyShardedDataParallel(
1141
- (_fsdp_wrapped_module): CheckpointWrapper(
1142
- (_checkpoint_wrapped_module): SiglipEncoderLayer(
1143
- (self_attn): SiglipFlashAttention2(
1144
- (k_proj): Linear(in_features=1152, out_features=1152, bias=True)
1145
- (v_proj): Linear(in_features=1152, out_features=1152, bias=True)
1146
- (q_proj): Linear(in_features=1152, out_features=1152, bias=True)
1147
- (out_proj): Linear(in_features=1152, out_features=1152, bias=True)
1148
- )
1149
- (layer_norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
1150
- (mlp): SiglipMLP(
1151
- (activation_fn): PytorchGELUTanh()
1152
- (fc1): Linear(in_features=1152, out_features=4304, bias=True)
1153
- (fc2): Linear(in_features=4304, out_features=1152, bias=True)
1154
- )
1155
- (layer_norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
1156
- )
1157
- )
1158
- )
1159
- )
1160
- )
1161
- (post_layernorm): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
1162
- )
1163
- )
1164
- )
1165
- (connector): FullyShardedDataParallel(
1166
- (_fsdp_wrapped_module): CheckpointWrapper(
1167
- (_checkpoint_wrapped_module): MLPconnector(
1168
- (activation_fn): PytorchGELUTanh()
1169
- (fc1): Linear(in_features=1152, out_features=3584, bias=True)
1170
- (fc2): Linear(in_features=3584, out_features=3584, bias=True)
1171
- )
1172
- )
1173
- )
1174
- (vit_pos_embed): FullyShardedDataParallel(
1175
- (_fsdp_wrapped_module): PositionEmbedding()
1176
- )
1177
- )
1178
- )
1179
- _flat_param True
1180
- language_model.model.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1181
- language_model.model.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1182
- language_model.model.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1183
- language_model.model.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1184
- language_model.model.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1185
- language_model.model.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1186
- language_model.model.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1187
- language_model.model.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1188
- language_model.model.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1189
- language_model.model.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1190
- language_model.model.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1191
- language_model.model.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1192
- language_model.model.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1193
- language_model.model.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1194
- language_model.model.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1195
- language_model.model.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1196
- language_model.model.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1197
- language_model.model.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1198
- language_model.model.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1199
- language_model.model.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1200
- language_model.model.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1201
- language_model.model.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1202
- language_model.model.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1203
- language_model.model.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1204
- language_model.model.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1205
- language_model.model.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1206
- language_model.model.layers.26._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1207
- language_model.model.layers.27._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1208
- time_embedder._fsdp_wrapped_module._flat_param True
1209
- latent_pos_embed._fsdp_wrapped_module._flat_param False
1210
- vit_model.vision_model._fsdp_wrapped_module._flat_param True
1211
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1212
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1213
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1214
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1215
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1216
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1217
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1218
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1219
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1220
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1221
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1222
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1223
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1224
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1225
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1226
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1227
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1228
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1229
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1230
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1231
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1232
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1233
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1234
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1235
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1236
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1237
- connector._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1238
- vit_pos_embed._fsdp_wrapped_module._flat_param False
1239
- Preparing Dataset vlm_gym_match_equation_sos_celoss/vlm_gym_match_equation_sos_train
1240
- base_dir is /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins_step0
1241
- Preparing Dataset vlm_gym_match_equation_sos_celoss_evalonce/vlm_gym_match_equation_sos_val
1242
- [eval debug] first 3 batch fingerprints:
1243
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
1244
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
1245
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
1246
- ce_avg: 1.8181179761886597, mse_avg: 0.6639004349708557
1247
- base_dir is /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins_step500
1248
- Preparing Dataset vlm_gym_match_equation_sos_celoss_evalonce/vlm_gym_match_equation_sos_val
1249
- [eval debug] first 3 batch fingerprints:
1250
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
1251
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
1252
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
1253
- ce_avg: 0.06269639730453491, mse_avg: 0.0149351442232728
1254
- base_dir is /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins_step1000
1255
- Preparing Dataset vlm_gym_match_equation_sos_celoss_evalonce/vlm_gym_match_equation_sos_val
1256
- [eval debug] first 3 batch fingerprints:
1257
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
1258
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
1259
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
1260
- ce_avg: 0.06731808930635452, mse_avg: 0.011200404725968838
1261
- base_dir is /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins_step1500
1262
- Preparing Dataset vlm_gym_match_equation_sos_celoss_evalonce/vlm_gym_match_equation_sos_val
1263
- [eval debug] first 3 batch fingerprints:
1264
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
1265
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
1266
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
1267
- ce_avg: 0.1969330757856369, mse_avg: 0.010982552543282509
1268
- base_dir is /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins_step2000
1269
- Preparing Dataset vlm_gym_match_equation_sos_celoss_evalonce/vlm_gym_match_equation_sos_val
1270
- [eval debug] first 3 batch fingerprints:
1271
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
1272
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
1273
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
1274
- ce_avg: 0.24203190207481384, mse_avg: 0.011826912872493267
1275
- base_dir is /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins_step2500
1276
- Preparing Dataset vlm_gym_match_equation_sos_celoss_evalonce/vlm_gym_match_equation_sos_val
1277
- [eval debug] first 3 batch fingerprints:
1278
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
1279
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
1280
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
1281
- ce_avg: 0.20800399780273438, mse_avg: 0.011797062121331692
1282
  [2026-01-26 22:14:06] (step=0001057) Train Loss mse: 0.0142, Train Loss ce: 0.0901, Train Steps/Sec: 0.68,
1283
  [2026-01-26 22:14:07] (step=0001058) Train Loss mse: 0.0084, Train Loss ce: 0.0419, Train Steps/Sec: 0.67,
1284
  [2026-01-26 22:14:09] (step=0001059) Train Loss mse: 0.0127, Train Loss ce: 0.0692, Train Steps/Sec: 0.56,
@@ -2707,6 +2700,27 @@ ce_avg: 0.20800399780273438, mse_avg: 0.011797062121331692
2707
  [2026-01-26 22:52:32] (step=0002482) Train Loss mse: 0.0074, Train Loss ce: 0.0610, Train Steps/Sec: 0.58,
2708
  [2026-01-26 22:52:33] (step=0002483) Train Loss mse: 0.0053, Train Loss ce: 0.0485, Train Steps/Sec: 0.68,
2709
  [2026-01-26 22:52:35] (step=0002484) Train Loss mse: 0.0062, Train Loss ce: 0.0460, Train Steps/Sec: 0.57,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2710
  [2026-01-26 22:52:37] (step=0002485) Train Loss mse: 0.0051, Train Loss ce: 0.0446, Train Steps/Sec: 0.69,
2711
  [2026-01-26 22:52:38] (step=0002486) Train Loss mse: 0.0086, Train Loss ce: 0.0438, Train Steps/Sec: 0.57,
2712
  [2026-01-26 22:52:41] (step=0002487) Train Loss mse: 0.0067, Train Loss ce: 0.0641, Train Steps/Sec: 0.44,
@@ -2732,20 +2746,6 @@ ce_avg: 0.20800399780273438, mse_avg: 0.011797062121331692
2732
  [2026-01-26 22:55:50] (step=0002504) Train Loss mse: 0.0089, Train Loss ce: 0.0733, Train Steps/Sec: 0.68,
2733
  [2026-01-26 22:55:52] (step=0002505) Train Loss mse: 0.0061, Train Loss ce: 0.0662, Train Steps/Sec: 0.68,
2734
  [2026-01-26 22:55:53] (step=0002506) Train Loss mse: 0.0068, Train Loss ce: 0.0412, Train Steps/Sec: 0.68,
2735
- base_dir is /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins_step3000
2736
- Preparing Dataset vlm_gym_match_equation_sos_celoss_evalonce/vlm_gym_match_equation_sos_val
2737
- [eval debug] first 3 batch fingerprints:
2738
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
2739
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
2740
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
2741
- ce_avg: 0.04329414293169975, mse_avg: 0.006353132426738739
2742
- base_dir is /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins_step3500
2743
- Preparing Dataset vlm_gym_match_equation_sos_celoss_evalonce/vlm_gym_match_equation_sos_val
2744
- [eval debug] first 3 batch fingerprints:
2745
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
2746
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
2747
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
2748
- ce_avg: 0.04275057464838028, mse_avg: 0.0056571937166154385
2749
  [2026-01-26 22:55:55] (step=0002507) Train Loss mse: 0.0085, Train Loss ce: 0.0475, Train Steps/Sec: 0.49,
2750
  [2026-01-26 22:55:57] (step=0002508) Train Loss mse: 0.0056, Train Loss ce: 0.0384, Train Steps/Sec: 0.69,
2751
  [2026-01-26 22:55:58] (step=0002509) Train Loss mse: 0.0106, Train Loss ce: 0.0332, Train Steps/Sec: 0.68,
@@ -3773,6 +3773,20 @@ ce_avg: 0.04275057464838028, mse_avg: 0.0056571937166154385
3773
  [2026-01-26 23:23:15] (step=0003531) Train Loss mse: 0.0057, Train Loss ce: 0.0184, Train Steps/Sec: 0.68,
3774
  [2026-01-26 23:23:16] (step=0003532) Train Loss mse: 0.0089, Train Loss ce: 0.0406, Train Steps/Sec: 0.68,
3775
  [2026-01-26 23:23:18] (step=0003533) Train Loss mse: 0.0058, Train Loss ce: 0.0525, Train Steps/Sec: 0.55,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3776
  [2026-01-26 23:23:20] (step=0003534) Train Loss mse: 0.0057, Train Loss ce: 0.0243, Train Steps/Sec: 0.59,
3777
  [2026-01-26 23:23:21] (step=0003535) Train Loss mse: 0.0072, Train Loss ce: 0.0283, Train Steps/Sec: 0.68,
3778
  [2026-01-26 23:23:23] (step=0003536) Train Loss mse: 0.0100, Train Loss ce: 0.0241, Train Steps/Sec: 0.68,
@@ -3792,27 +3806,6 @@ ce_avg: 0.04275057464838028, mse_avg: 0.0056571937166154385
3792
  [2026-01-26 23:23:44] (step=0003550) Train Loss mse: 0.0066, Train Loss ce: 0.0326, Train Steps/Sec: 0.68,
3793
  [2026-01-26 23:23:46] (step=0003551) Train Loss mse: 0.0086, Train Loss ce: 0.0259, Train Steps/Sec: 0.48,
3794
  [2026-01-26 23:23:48] (step=0003552) Train Loss mse: 0.0122, Train Loss ce: 0.0249, Train Steps/Sec: 0.59,
3795
- base_dir is /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins_step4000
3796
- Preparing Dataset vlm_gym_match_equation_sos_celoss_evalonce/vlm_gym_match_equation_sos_val
3797
- [eval debug] first 3 batch fingerprints:
3798
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
3799
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
3800
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
3801
- ce_avg: 0.03896063566207886, mse_avg: 0.005920059513300657
3802
- base_dir is /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins_step4500
3803
- Preparing Dataset vlm_gym_match_equation_sos_celoss_evalonce/vlm_gym_match_equation_sos_val
3804
- [eval debug] first 3 batch fingerprints:
3805
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
3806
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
3807
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
3808
- ce_avg: 0.03593315929174423, mse_avg: 0.005641990341246128
3809
- base_dir is /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins_step5000
3810
- Preparing Dataset vlm_gym_match_equation_sos_celoss_evalonce/vlm_gym_match_equation_sos_val
3811
- [eval debug] first 3 batch fingerprints:
3812
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
3813
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
3814
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
3815
- ce_avg: 0.03604895621538162, mse_avg: 0.005358231253921986
3816
  [2026-01-26 23:23:49] (step=0003553) Train Loss mse: 0.0082, Train Loss ce: 0.0280, Train Steps/Sec: 0.68,
3817
  [2026-01-26 23:23:51] (step=0003554) Train Loss mse: 0.0045, Train Loss ce: 0.0642, Train Steps/Sec: 0.68,
3818
  [2026-01-26 23:23:52] (step=0003555) Train Loss mse: 0.0075, Train Loss ce: 0.0468, Train Steps/Sec: 0.68,
@@ -5262,4 +5255,11 @@ ce_avg: 0.03604895621538162, mse_avg: 0.005358231253921986
5262
  [2026-01-27 00:02:27] (step=0004999) Train Loss mse: 0.0058, Train Loss ce: 0.0432, Train Steps/Sec: 0.57,
5263
  [2026-01-27 00:02:33] (step=0005000) Train Loss mse: 0.0040, Train Loss ce: 0.0410, Train Steps/Sec: 0.15,
5264
  [2026-01-27 00:02:33] Saving checkpoint to /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins/0005000.
5265
- [2026-01-27 00:05:11] Done!
 
 
 
 
 
 
 
 
1
+ FullyShardedDataParallel(
2
+ (_fsdp_wrapped_module): Bagel(
3
+ (language_model): Qwen2ForCausalLM(
4
+ (model): Qwen2Model(
5
+ (embed_tokens): Embedding(152064, 3584)
6
+ (layers): ModuleList(
7
+ (0-27): 28 x FullyShardedDataParallel(
8
+ (_fsdp_wrapped_module): CheckpointWrapper(
9
+ (_checkpoint_wrapped_module): Qwen2MoTDecoderLayer(
10
+ (self_attn): PackedAttentionMoT(
11
+ (q_proj): Linear(in_features=3584, out_features=3584, bias=True)
12
+ (k_proj): Linear(in_features=3584, out_features=512, bias=True)
13
+ (v_proj): Linear(in_features=3584, out_features=512, bias=True)
14
+ (o_proj): Linear(in_features=3584, out_features=3584, bias=False)
15
+ (q_norm): Qwen2RMSNorm((128,), eps=1e-06)
16
+ (k_norm): Qwen2RMSNorm((128,), eps=1e-06)
17
+ (q_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
18
+ (k_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
19
+ (q_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=True)
20
+ (k_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
21
+ (v_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
22
+ (o_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=False)
23
+ )
24
+ (mlp): Qwen2MLP(
25
+ (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
26
+ (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
27
+ (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
28
+ (act_fn): SiLU()
29
+ )
30
+ (mlp_moe_gen): Qwen2MLP(
31
+ (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
32
+ (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
33
+ (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
34
+ (act_fn): SiLU()
35
+ )
36
+ (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
37
+ (input_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
38
+ (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
39
+ (post_attention_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
40
+ )
41
+ )
42
+ )
43
+ )
44
+ (norm): Qwen2RMSNorm((3584,), eps=1e-06)
45
+ (norm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
46
+ (rotary_emb): Qwen2RotaryEmbedding()
47
+ )
48
+ (lm_head): Linear(in_features=3584, out_features=152064, bias=False)
49
+ )
50
+ (time_embedder): FullyShardedDataParallel(
51
+ (_fsdp_wrapped_module): TimestepEmbedder(
52
+ (mlp): Sequential(
53
+ (0): Linear(in_features=256, out_features=3584, bias=True)
54
+ (1): SiLU()
55
+ (2): Linear(in_features=3584, out_features=3584, bias=True)
56
+ )
57
+ )
58
+ )
59
+ (vae2llm): Linear(in_features=64, out_features=3584, bias=True)
60
+ (llm2vae): Linear(in_features=3584, out_features=64, bias=True)
61
+ (latent_pos_embed): FullyShardedDataParallel(
62
+ (_fsdp_wrapped_module): PositionEmbedding()
63
+ )
64
+ (vit_model): SiglipVisionModel(
65
+ (vision_model): FullyShardedDataParallel(
66
+ (_fsdp_wrapped_module): SiglipVisionTransformer(
67
+ (embeddings): SiglipVisionEmbeddings(
68
+ (position_embedding): Embedding(4900, 1152)
69
+ (patch_embedding): Linear(in_features=588, out_features=1152, bias=True)
70
+ )
71
+ (encoder): SiglipEncoder(
72
+ (layers): ModuleList(
73
+ (0-25): 26 x FullyShardedDataParallel(
74
+ (_fsdp_wrapped_module): CheckpointWrapper(
75
+ (_checkpoint_wrapped_module): SiglipEncoderLayer(
76
+ (self_attn): SiglipFlashAttention2(
77
+ (k_proj): Linear(in_features=1152, out_features=1152, bias=True)
78
+ (v_proj): Linear(in_features=1152, out_features=1152, bias=True)
79
+ (q_proj): Linear(in_features=1152, out_features=1152, bias=True)
80
+ (out_proj): Linear(in_features=1152, out_features=1152, bias=True)
81
+ )
82
+ (layer_norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
83
+ (mlp): SiglipMLP(
84
+ (activation_fn): PytorchGELUTanh()
85
+ (fc1): Linear(in_features=1152, out_features=4304, bias=True)
86
+ (fc2): Linear(in_features=4304, out_features=1152, bias=True)
87
+ )
88
+ (layer_norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
89
+ )
90
+ )
91
+ )
92
+ )
93
+ )
94
+ (post_layernorm): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
95
+ )
96
+ )
97
+ )
98
+ (connector): FullyShardedDataParallel(
99
+ (_fsdp_wrapped_module): CheckpointWrapper(
100
+ (_checkpoint_wrapped_module): MLPconnector(
101
+ (activation_fn): PytorchGELUTanh()
102
+ (fc1): Linear(in_features=1152, out_features=3584, bias=True)
103
+ (fc2): Linear(in_features=3584, out_features=3584, bias=True)
104
+ )
105
+ )
106
+ )
107
+ (vit_pos_embed): FullyShardedDataParallel(
108
+ (_fsdp_wrapped_module): PositionEmbedding()
109
+ )
110
+ )
111
+ )
112
+ _flat_param True
113
+ language_model.model.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
114
+ language_model.model.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
115
+ language_model.model.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
116
+ language_model.model.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
117
+ language_model.model.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
118
+ language_model.model.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
119
+ language_model.model.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
120
+ language_model.model.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
121
+ language_model.model.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
122
+ language_model.model.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
123
+ language_model.model.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
124
+ language_model.model.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
125
+ language_model.model.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
126
+ language_model.model.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
127
+ language_model.model.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
128
+ language_model.model.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
129
+ language_model.model.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
130
+ language_model.model.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
131
+ language_model.model.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
132
+ language_model.model.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
133
+ language_model.model.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
134
+ language_model.model.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
135
+ language_model.model.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
136
+ language_model.model.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
137
+ language_model.model.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
138
+ language_model.model.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
139
+ language_model.model.layers.26._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
140
+ language_model.model.layers.27._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
141
+ time_embedder._fsdp_wrapped_module._flat_param True
142
+ latent_pos_embed._fsdp_wrapped_module._flat_param False
143
+ vit_model.vision_model._fsdp_wrapped_module._flat_param True
144
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
145
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
146
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
147
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
148
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
149
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
150
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
151
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
152
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
153
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
154
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
155
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
156
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
157
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
158
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
159
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
160
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
161
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
162
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
163
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
164
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
165
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
166
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
167
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
168
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
169
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
170
+ connector._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
171
+ vit_pos_embed._fsdp_wrapped_module._flat_param False
172
+ Preparing Dataset vlm_gym_match_equation_sos_celoss/vlm_gym_match_equation_sos_train
173
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins_step0
174
+ Preparing Dataset vlm_gym_match_equation_sos_celoss_evalonce/vlm_gym_match_equation_sos_val
175
+ [eval debug] first 3 batch fingerprints:
176
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
177
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
178
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
179
+ ce_avg: 1.8181179761886597, mse_avg: 0.6639004349708557
180
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins_step500
181
+ Preparing Dataset vlm_gym_match_equation_sos_celoss_evalonce/vlm_gym_match_equation_sos_val
182
+ [eval debug] first 3 batch fingerprints:
183
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
184
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
185
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
186
+ ce_avg: 0.06269639730453491, mse_avg: 0.0149351442232728
187
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins_step1000
188
+ Preparing Dataset vlm_gym_match_equation_sos_celoss_evalonce/vlm_gym_match_equation_sos_val
189
+ [eval debug] first 3 batch fingerprints:
190
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
191
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
192
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
193
+ ce_avg: 0.06731808930635452, mse_avg: 0.011200404725968838
194
  wandb: Detected [huggingface_hub.inference] in use.
195
  wandb: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
196
  wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
 
1211
  [2026-01-26 22:12:48] (step=0001007) Train Loss mse: 0.0116, Train Loss ce: 0.0712, Train Steps/Sec: 0.58,
1212
  [2026-01-26 22:12:50] (step=0001008) Train Loss mse: 0.0130, Train Loss ce: 0.0336, Train Steps/Sec: 0.68,
1213
  [2026-01-26 22:12:51] (step=0001009) Train Loss mse: 0.0141, Train Loss ce: 0.0598, Train Steps/Sec: 0.68,
1214
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins_step1500
1215
+ Preparing Dataset vlm_gym_match_equation_sos_celoss_evalonce/vlm_gym_match_equation_sos_val
1216
+ [eval debug] first 3 batch fingerprints:
1217
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
1218
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
1219
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
1220
+ ce_avg: 0.1969330757856369, mse_avg: 0.010982552543282509
1221
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins_step2000
1222
+ Preparing Dataset vlm_gym_match_equation_sos_celoss_evalonce/vlm_gym_match_equation_sos_val
1223
+ [eval debug] first 3 batch fingerprints:
1224
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
1225
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
1226
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
1227
+ ce_avg: 0.24203190207481384, mse_avg: 0.011826912872493267
1228
  [2026-01-26 22:12:53] (step=0001010) Train Loss mse: 0.0162, Train Loss ce: 0.0454, Train Steps/Sec: 0.68,
1229
  [2026-01-26 22:12:54] (step=0001011) Train Loss mse: 0.0109, Train Loss ce: 0.0345, Train Steps/Sec: 0.68,
1230
  [2026-01-26 22:12:56] (step=0001012) Train Loss mse: 0.0095, Train Loss ce: 0.0584, Train Steps/Sec: 0.68,
 
1272
  [2026-01-26 22:14:01] (step=0001054) Train Loss mse: 0.0175, Train Loss ce: 0.0570, Train Steps/Sec: 0.68,
1273
  [2026-01-26 22:14:03] (step=0001055) Train Loss mse: 0.0202, Train Loss ce: 0.0187, Train Steps/Sec: 0.68,
1274
  [2026-01-26 22:14:04] (step=0001056) Train Loss mse: 0.0158, Train Loss ce: 0.0599, Train Steps/Sec: 0.68,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1275
  [2026-01-26 22:14:06] (step=0001057) Train Loss mse: 0.0142, Train Loss ce: 0.0901, Train Steps/Sec: 0.68,
1276
  [2026-01-26 22:14:07] (step=0001058) Train Loss mse: 0.0084, Train Loss ce: 0.0419, Train Steps/Sec: 0.67,
1277
  [2026-01-26 22:14:09] (step=0001059) Train Loss mse: 0.0127, Train Loss ce: 0.0692, Train Steps/Sec: 0.56,
 
2700
  [2026-01-26 22:52:32] (step=0002482) Train Loss mse: 0.0074, Train Loss ce: 0.0610, Train Steps/Sec: 0.58,
2701
  [2026-01-26 22:52:33] (step=0002483) Train Loss mse: 0.0053, Train Loss ce: 0.0485, Train Steps/Sec: 0.68,
2702
  [2026-01-26 22:52:35] (step=0002484) Train Loss mse: 0.0062, Train Loss ce: 0.0460, Train Steps/Sec: 0.57,
2703
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins_step2500
2704
+ Preparing Dataset vlm_gym_match_equation_sos_celoss_evalonce/vlm_gym_match_equation_sos_val
2705
+ [eval debug] first 3 batch fingerprints:
2706
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
2707
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
2708
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
2709
+ ce_avg: 0.20800399780273438, mse_avg: 0.011797062121331692
2710
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins_step3000
2711
+ Preparing Dataset vlm_gym_match_equation_sos_celoss_evalonce/vlm_gym_match_equation_sos_val
2712
+ [eval debug] first 3 batch fingerprints:
2713
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
2714
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
2715
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
2716
+ ce_avg: 0.04329414293169975, mse_avg: 0.006353132426738739
2717
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins_step3500
2718
+ Preparing Dataset vlm_gym_match_equation_sos_celoss_evalonce/vlm_gym_match_equation_sos_val
2719
+ [eval debug] first 3 batch fingerprints:
2720
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
2721
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
2722
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
2723
+ ce_avg: 0.04275057464838028, mse_avg: 0.0056571937166154385
2724
  [2026-01-26 22:52:37] (step=0002485) Train Loss mse: 0.0051, Train Loss ce: 0.0446, Train Steps/Sec: 0.69,
2725
  [2026-01-26 22:52:38] (step=0002486) Train Loss mse: 0.0086, Train Loss ce: 0.0438, Train Steps/Sec: 0.57,
2726
  [2026-01-26 22:52:41] (step=0002487) Train Loss mse: 0.0067, Train Loss ce: 0.0641, Train Steps/Sec: 0.44,
 
2746
  [2026-01-26 22:55:50] (step=0002504) Train Loss mse: 0.0089, Train Loss ce: 0.0733, Train Steps/Sec: 0.68,
2747
  [2026-01-26 22:55:52] (step=0002505) Train Loss mse: 0.0061, Train Loss ce: 0.0662, Train Steps/Sec: 0.68,
2748
  [2026-01-26 22:55:53] (step=0002506) Train Loss mse: 0.0068, Train Loss ce: 0.0412, Train Steps/Sec: 0.68,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2749
  [2026-01-26 22:55:55] (step=0002507) Train Loss mse: 0.0085, Train Loss ce: 0.0475, Train Steps/Sec: 0.49,
2750
  [2026-01-26 22:55:57] (step=0002508) Train Loss mse: 0.0056, Train Loss ce: 0.0384, Train Steps/Sec: 0.69,
2751
  [2026-01-26 22:55:58] (step=0002509) Train Loss mse: 0.0106, Train Loss ce: 0.0332, Train Steps/Sec: 0.68,
 
3773
  [2026-01-26 23:23:15] (step=0003531) Train Loss mse: 0.0057, Train Loss ce: 0.0184, Train Steps/Sec: 0.68,
3774
  [2026-01-26 23:23:16] (step=0003532) Train Loss mse: 0.0089, Train Loss ce: 0.0406, Train Steps/Sec: 0.68,
3775
  [2026-01-26 23:23:18] (step=0003533) Train Loss mse: 0.0058, Train Loss ce: 0.0525, Train Steps/Sec: 0.55,
3776
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins_step4000
3777
+ Preparing Dataset vlm_gym_match_equation_sos_celoss_evalonce/vlm_gym_match_equation_sos_val
3778
+ [eval debug] first 3 batch fingerprints:
3779
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
3780
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
3781
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
3782
+ ce_avg: 0.03896063566207886, mse_avg: 0.005920059513300657
3783
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins_step4500
3784
+ Preparing Dataset vlm_gym_match_equation_sos_celoss_evalonce/vlm_gym_match_equation_sos_val
3785
+ [eval debug] first 3 batch fingerprints:
3786
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
3787
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
3788
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
3789
+ ce_avg: 0.03593315929174423, mse_avg: 0.005641990341246128
3790
  [2026-01-26 23:23:20] (step=0003534) Train Loss mse: 0.0057, Train Loss ce: 0.0243, Train Steps/Sec: 0.59,
3791
  [2026-01-26 23:23:21] (step=0003535) Train Loss mse: 0.0072, Train Loss ce: 0.0283, Train Steps/Sec: 0.68,
3792
  [2026-01-26 23:23:23] (step=0003536) Train Loss mse: 0.0100, Train Loss ce: 0.0241, Train Steps/Sec: 0.68,
 
3806
  [2026-01-26 23:23:44] (step=0003550) Train Loss mse: 0.0066, Train Loss ce: 0.0326, Train Steps/Sec: 0.68,
3807
  [2026-01-26 23:23:46] (step=0003551) Train Loss mse: 0.0086, Train Loss ce: 0.0259, Train Steps/Sec: 0.48,
3808
  [2026-01-26 23:23:48] (step=0003552) Train Loss mse: 0.0122, Train Loss ce: 0.0249, Train Steps/Sec: 0.59,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3809
  [2026-01-26 23:23:49] (step=0003553) Train Loss mse: 0.0082, Train Loss ce: 0.0280, Train Steps/Sec: 0.68,
3810
  [2026-01-26 23:23:51] (step=0003554) Train Loss mse: 0.0045, Train Loss ce: 0.0642, Train Steps/Sec: 0.68,
3811
  [2026-01-26 23:23:52] (step=0003555) Train Loss mse: 0.0075, Train Loss ce: 0.0468, Train Steps/Sec: 0.68,
 
5255
  [2026-01-27 00:02:27] (step=0004999) Train Loss mse: 0.0058, Train Loss ce: 0.0432, Train Steps/Sec: 0.57,
5256
  [2026-01-27 00:02:33] (step=0005000) Train Loss mse: 0.0040, Train Loss ce: 0.0410, Train Steps/Sec: 0.15,
5257
  [2026-01-27 00:02:33] Saving checkpoint to /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins/0005000.
5258
+ [2026-01-27 00:05:11] Done!
5259
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_ce_ins_step5000
5260
+ Preparing Dataset vlm_gym_match_equation_sos_celoss_evalonce/vlm_gym_match_equation_sos_val
5261
+ [eval debug] first 3 batch fingerprints:
5262
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
5263
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
5264
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_celoss_evalonce'}]
5265
+ ce_avg: 0.03604895621538162, mse_avg: 0.005358231253921986