Spaces:

asdfasdfdsafdsa
/

pgps-demo

Sleeping

App Files Files Community

asdfasdfdsafdsa commited on Aug 24, 2025

Commit

96836c8

verified ·

1 Parent(s): cc4d3bc

Fix tensor dimension mismatch by disabling MLM pretrain for demo

Browse files

Files changed (3) hide show

MLM_PRETRAIN_NOTE.md +71 -0
app.py +1 -1
simple_inference.py +10 -21

MLM_PRETRAIN_NOTE.md ADDED Viewed

	@@ -0,0 +1,71 @@

+# MLM Pretrain Dimension Mismatch Issue
+## Problem Description
+When `use_MLM_pretrain = True`, the model encounters a tensor dimension mismatch error:
+```
+The size of tensor a (56) must match the size of tensor b (55) at non-singleton dimension 1
+```
+## Root Cause Analysis
+The issue occurs due to the following sequence of operations in `Network.forward()`:
+1. **MLM Pretrain Processing (if enabled):**
+   - `MLMTransformerPretrain.forward(text_dict)` is called with original text length N
+   - Inside, it creates embeddings and attention masks for length N
+   - Returns `text_emb_src` with shape `[batch, N, embed_dim]`
+2. **Diagram Concatenation:**
+   - `diagram_emb_src` is created with shape `[batch, 1, embed_dim]`
+   - These are concatenated: `all_emb_src = torch.cat([diagram_emb_src, text_emb_src], dim=1)`
+   - Result has shape `[batch, N+1, embed_dim]`
+3. **Length Adjustment:**
+   - `text_dict['len'] += 1` (now N+1)
+   - `var_dict['pos'] += 1`
+4. **Issue:**
+   - The MLM pretrain's TransformerEncoder has already created internal states (position embeddings, attention masks) for length N
+   - But the actual sequence now has length N+1
+   - This causes dimension mismatches in subsequent operations
+## Solution for Demo
+For the demo, we've disabled MLM pretrain by setting `use_MLM_pretrain = False` in the Config class. This uses the simpler embedding path that properly handles the dimension adjustments.
+## Alternative Solutions (if MLM pretrain is needed)
+### Option 1: Pre-allocate space for diagram
+Modify the MLM pretrain path to account for the diagram token from the start:
+```python
+if self.cfg.use_MLM_pretrain:
+    # Increment length before MLM pretrain
+    text_dict_copy = text_dict.copy()
+    text_dict_copy['len'] = text_dict['len'] + 1
+    # Add padding for diagram position
+    # ... adjust tokens/tags accordingly
+    text_emb_src = self.mlm_pretrain(text_dict_copy)
+```
+### Option 2: Post-process MLM output
+Recompute position embeddings and masks after concatenation:
+```python
+if self.cfg.use_MLM_pretrain:
+    text_emb_src = self.mlm_pretrain(text_dict)
+    # After concatenation, reapply position encoding
+    all_emb_src = torch.cat([diagram_emb_src, text_emb_src], dim=1)
+    # Recompute position embeddings for new length
+    all_emb_src = self.recompute_positions(all_emb_src, text_dict['len'] + 1)
+```
+### Option 3: Separate diagram processing
+Process diagram separately and combine at a later stage rather than concatenating embeddings.
+## Testing
+To verify the fix works:
+1. Upload an image and text to the demo
+2. The model should process without dimension errors
+3. Output should be generated (even if not perfectly accurate without MLM pretrain)
+## Performance Impact
+Disabling MLM pretrain may reduce model accuracy since the pre-trained language model helps with understanding geometric relationships. However, it ensures stable operation for the demo.

app.py CHANGED Viewed

@@ -43,7 +43,7 @@ class Config:
         # General
         self.dropout_rate = 0.2
         self.beam_size = 10
-        self.use_MLM_pretrain = True
         self.MLM_pretrain_path = './LM_MODEL.pth'
         self.pretrain_emb_path = ''

         # General
         self.dropout_rate = 0.2
         self.beam_size = 10
+        self.use_MLM_pretrain = False  # Disabled due to dimension mismatch issues in demo
         self.MLM_pretrain_path = './LM_MODEL.pth'
         self.pretrain_emb_path = ''

simple_inference.py CHANGED Viewed

@@ -50,27 +50,16 @@ def simple_process_input(image, text_input, model, src_lang, tgt_lang, cfg):
     # For simple case, we have 1 subword per token, so shape is [batch, 1, seq_len]
     # This gets embedded and summed over dim=1 to get [batch, seq_len, embed_dim]
-    if cfg.use_MLM_pretrain:
-        # Create 3D tensor: [batch_size, 1, text_len]
-        # Each token is a single subword, so middle dimension is 1
-        token_tensor_3d = token_tensor.unsqueeze(1)  # [batch, 1, seq_len]
-        text_dict = {
-            'token': token_tensor_3d,
-            'sect_tag': torch.LongTensor([sect_tag_indices]).to(device),
-            'class_tag': torch.LongTensor([class_tag_indices]).to(device),
-            'len': torch.LongTensor([text_len]).to(device)
-        }
-    else:
-        # Non-MLM path also needs 3D tensor for consistency
-        token_tensor_3d = token_tensor.unsqueeze(1)  # [batch, 1, seq_len]
-        text_dict = {
-            'token': token_tensor_3d,
-            'sect_tag': torch.LongTensor([sect_tag_indices]).to(device),
-            'class_tag': torch.LongTensor([class_tag_indices]).to(device),
-            'len': torch.LongTensor([text_len]).to(device)
-        }
     # Simple var dict (no variables detected)
     # Note: var positions need to account for the diagram token that will be added

     # For simple case, we have 1 subword per token, so shape is [batch, 1, seq_len]
     # This gets embedded and summed over dim=1 to get [batch, seq_len, embed_dim]
+    # Create 3D tensor: [batch_size, 1, text_len]
+    # Each token is a single subword, so middle dimension is 1
+    token_tensor_3d = token_tensor.unsqueeze(1)  # [batch, 1, seq_len]
+    text_dict = {
+        'token': token_tensor_3d,
+        'sect_tag': torch.LongTensor([sect_tag_indices]).to(device),
+        'class_tag': torch.LongTensor([class_tag_indices]).to(device),
+        'len': torch.LongTensor([text_len]).to(device)
+    }
     # Simple var dict (no variables detected)
     # Note: var positions need to account for the diagram token that will be added