File size: 2,917 Bytes
96836c8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
# MLM Pretrain Dimension Mismatch Issue

## Problem Description
When `use_MLM_pretrain = True`, the model encounters a tensor dimension mismatch error:
```
The size of tensor a (56) must match the size of tensor b (55) at non-singleton dimension 1
```

## Root Cause Analysis

The issue occurs due to the following sequence of operations in `Network.forward()`:

1. **MLM Pretrain Processing (if enabled):**
   - `MLMTransformerPretrain.forward(text_dict)` is called with original text length N
   - Inside, it creates embeddings and attention masks for length N
   - Returns `text_emb_src` with shape `[batch, N, embed_dim]`

2. **Diagram Concatenation:**
   - `diagram_emb_src` is created with shape `[batch, 1, embed_dim]`
   - These are concatenated: `all_emb_src = torch.cat([diagram_emb_src, text_emb_src], dim=1)`
   - Result has shape `[batch, N+1, embed_dim]`

3. **Length Adjustment:**
   - `text_dict['len'] += 1` (now N+1)
   - `var_dict['pos'] += 1`

4. **Issue:**
   - The MLM pretrain's TransformerEncoder has already created internal states (position embeddings, attention masks) for length N
   - But the actual sequence now has length N+1
   - This causes dimension mismatches in subsequent operations

## Solution for Demo

For the demo, we've disabled MLM pretrain by setting `use_MLM_pretrain = False` in the Config class. This uses the simpler embedding path that properly handles the dimension adjustments.

## Alternative Solutions (if MLM pretrain is needed)

### Option 1: Pre-allocate space for diagram
Modify the MLM pretrain path to account for the diagram token from the start:
```python
if self.cfg.use_MLM_pretrain:
    # Increment length before MLM pretrain
    text_dict_copy = text_dict.copy()
    text_dict_copy['len'] = text_dict['len'] + 1
    # Add padding for diagram position
    # ... adjust tokens/tags accordingly
    text_emb_src = self.mlm_pretrain(text_dict_copy)
```

### Option 2: Post-process MLM output
Recompute position embeddings and masks after concatenation:
```python
if self.cfg.use_MLM_pretrain:
    text_emb_src = self.mlm_pretrain(text_dict)
    # After concatenation, reapply position encoding
    all_emb_src = torch.cat([diagram_emb_src, text_emb_src], dim=1)
    # Recompute position embeddings for new length
    all_emb_src = self.recompute_positions(all_emb_src, text_dict['len'] + 1)
```

### Option 3: Separate diagram processing
Process diagram separately and combine at a later stage rather than concatenating embeddings.

## Testing
To verify the fix works:
1. Upload an image and text to the demo
2. The model should process without dimension errors
3. Output should be generated (even if not perfectly accurate without MLM pretrain)

## Performance Impact
Disabling MLM pretrain may reduce model accuracy since the pre-trained language model helps with understanding geometric relationships. However, it ensures stable operation for the demo.