Spaces:
Sleeping
Sleeping
File size: 6,328 Bytes
ef12530 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 |
# π MLM Probability Fix - Complete Documentation
## Issue Identified
The user correctly observed that **changing the MLM probability did not affect the results at all** in the encoder model visualization. This was a significant bug in how the MLM probability parameter was being used.
## Root Cause Analysis
### What Was Wrong
The MLM probability setting had two separate effects that were not properly connected:
1. **Average Perplexity Calculation** β
(Working correctly)
- Used random masking with the specified MLM probability
- Affected the summary statistic shown to the user
2. **Per-Token Visualization** β (Bug was here)
- Always masked each token individually
- Completely ignored the MLM probability setting
- This meant changing MLM probability had no visual effect
### The Disconnect
```python
# OLD CODE - MLM probability was ignored for visualization
for i in range(len(tokens)):
if not special_token:
# ALWAYS calculated detailed perplexity for every token
masked_input[0, i] = tokenizer.mask_token_id
# ... calculate perplexity
```
## The Fix
### 1. Made MLM Probability Affect Visualization
Now the MLM probability controls which tokens get detailed analysis:
```python
# NEW CODE - MLM probability affects visualization
for i in range(len(tokens)):
if not special_token:
if torch.rand(1).item() < mlm_probability: # β
Now respects MLM prob
# Calculate detailed perplexity for this token
masked_input[0, i] = tokenizer.mask_token_id
# ... calculate detailed perplexity
else:
# Use baseline perplexity for non-analyzed tokens
token_perplexities.append(2.0) # Neutral baseline
```
### 2. Visual Distinction
- **Analyzed tokens**: Colored by actual perplexity (green/yellow/red)
- **Non-analyzed tokens**: Gray color with baseline perplexity
- **Tooltip**: Shows whether token was analyzed or not
### 3. Clear User Feedback
- Summary now shows: `MLM Probability: 0.15 (3/8 tokens analyzed in detail)`
- Legend updated: `π’ Low β π‘ Medium β π΄ High β β« Not analyzed`
- Improved help text: "Probability of detailed analysis per token"
## How It Works Now
### Low MLM Probability (0.15)
```
Input: "The capital of France is Paris"
Result: Only ~15% of tokens get detailed analysis
Visualization: Mostly gray tokens with a few colored ones
Effect: Fast analysis, matches BERT training conditions
```
### High MLM Probability (0.5)
```
Input: "The capital of France is Paris"
Result: ~50% of tokens get detailed analysis
Visualization: More colored tokens, fewer gray ones
Effect: More comprehensive but slower analysis
```
## User Experience Improvements
### Before the Fix
- User changes MLM probability from 0.15 β 0.5
- No visual change in token colors
- Only summary statistic changed (confusing!)
### After the Fix
- User changes MLM probability from 0.15 β 0.5
- More tokens become colored (analyzed)
- Fewer tokens remain gray (non-analyzed)
- Summary shows token count: "(3/8 tokens analyzed)"
- Clear visual feedback of the parameter's effect
## Testing the Fix
### 1. Quick Test
Try the same text with different MLM probabilities:
- Text: "Machine learning algorithms require computational resources"
- MLM 0.2: Few colored tokens
- MLM 0.8: Most tokens colored
### 2. Demo Script
```bash
python mlm_demo.py
```
Shows exactly how MLM probability affects analysis.
### 3. Visual Examples
The app now includes example pairs:
- Same text with MLM 0.2 vs 0.8
- Shows clear visual difference
## Technical Details
### Randomness Handling
- Uses `torch.rand()` for consistency with PyTorch
- Each token gets independent random chance
- Reproducible with manual seeds for testing
### Baseline Perplexity
- Non-analyzed tokens get perplexity = 2.0
- This represents "neutral" confidence
- Avoids misleading very low/high values
### Color Mapping
- Analyzed tokens: Full color spectrum based on actual perplexity
- Non-analyzed tokens: Gray (`rgb(200, 200, 200)`)
- Tooltips distinguish: "Perplexity: 5.2" vs "Not analyzed"
## Performance Implications
### Lower MLM Probability (0.15)
- **Pros**: Faster, matches BERT training, realistic
- **Cons**: Sparse analysis, some tokens not evaluated
### Higher MLM Probability (0.8)
- **Pros**: Comprehensive analysis, more visual information
- **Cons**: Slower computation, unrealistic for MLM
### Recommendation
- **Default 0.15**: Standard BERT-like analysis
- **Increase to 0.3-0.5**: For more detailed exploration
- **Avoid >0.8**: Diminishing returns, very slow
## Impact on Model Types
### Decoder Models (GPT, etc.)
- **No change**: MLM probability only affects encoder models
- Always analyze all tokens for next-token prediction
### Encoder Models (BERT, etc.)
- **Major improvement**: MLM probability now has clear visual effect
- Users can explore different analysis depths
- Better understanding of model confidence patterns
## User Guidance
### When to Use Different MLM Probabilities
**0.15 (Standard)**
- Quick analysis
- Matches BERT training
- Good for initial exploration
**0.3-0.4 (Detailed)**
- More comprehensive view
- Better for understanding difficult texts
- Reasonable computation time
**0.5+ (Comprehensive)**
- Maximum detail
- Research/analysis purposes
- Slower but thorough
## Future Enhancements
### Possible Improvements
1. **Adaptive MLM**: Adjust probability based on text difficulty
2. **Token importance**: Prioritize content words over function words
3. **Interactive selection**: Let users click tokens to analyze
4. **Batch analysis**: Process multiple MLM probabilities simultaneously
### Configuration Options
The fix is fully configurable via `config.py`:
- Default MLM probability
- Min/max ranges
- Baseline perplexity value
- Color scheme for non-analyzed tokens
## Conclusion
This fix transforms the MLM probability from a "hidden parameter" that only affected summary statistics into a **visible, interactive control** that directly impacts the visualization. Users now get immediate visual feedback when adjusting MLM probability, making the parameter's purpose clear and the analysis more engaging.
The fix maintains backward compatibility while significantly improving the user experience for encoder model analysis. π |