Spaces:
Sleeping
A newer version of the Gradio SDK is available:
6.4.0
MLM Pretrain Dimension Mismatch Issue
Problem Description
When use_MLM_pretrain = True, the model encounters a tensor dimension mismatch error:
The size of tensor a (56) must match the size of tensor b (55) at non-singleton dimension 1
Root Cause Analysis
The issue occurs due to the following sequence of operations in Network.forward():
MLM Pretrain Processing (if enabled):
MLMTransformerPretrain.forward(text_dict)is called with original text length N- Inside, it creates embeddings and attention masks for length N
- Returns
text_emb_srcwith shape[batch, N, embed_dim]
Diagram Concatenation:
diagram_emb_srcis created with shape[batch, 1, embed_dim]- These are concatenated:
all_emb_src = torch.cat([diagram_emb_src, text_emb_src], dim=1) - Result has shape
[batch, N+1, embed_dim]
Length Adjustment:
text_dict['len'] += 1(now N+1)var_dict['pos'] += 1
Issue:
- The MLM pretrain's TransformerEncoder has already created internal states (position embeddings, attention masks) for length N
- But the actual sequence now has length N+1
- This causes dimension mismatches in subsequent operations
Solution for Demo
For the demo, we've disabled MLM pretrain by setting use_MLM_pretrain = False in the Config class. This uses the simpler embedding path that properly handles the dimension adjustments.
Alternative Solutions (if MLM pretrain is needed)
Option 1: Pre-allocate space for diagram
Modify the MLM pretrain path to account for the diagram token from the start:
if self.cfg.use_MLM_pretrain:
# Increment length before MLM pretrain
text_dict_copy = text_dict.copy()
text_dict_copy['len'] = text_dict['len'] + 1
# Add padding for diagram position
# ... adjust tokens/tags accordingly
text_emb_src = self.mlm_pretrain(text_dict_copy)
Option 2: Post-process MLM output
Recompute position embeddings and masks after concatenation:
if self.cfg.use_MLM_pretrain:
text_emb_src = self.mlm_pretrain(text_dict)
# After concatenation, reapply position encoding
all_emb_src = torch.cat([diagram_emb_src, text_emb_src], dim=1)
# Recompute position embeddings for new length
all_emb_src = self.recompute_positions(all_emb_src, text_dict['len'] + 1)
Option 3: Separate diagram processing
Process diagram separately and combine at a later stage rather than concatenating embeddings.
Testing
To verify the fix works:
- Upload an image and text to the demo
- The model should process without dimension errors
- Output should be generated (even if not perfectly accurate without MLM pretrain)
Performance Impact
Disabling MLM pretrain may reduce model accuracy since the pre-trained language model helps with understanding geometric relationships. However, it ensures stable operation for the demo.