File size: 9,352 Bytes
2ecccdf |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 |
# Distribution Normalization for Debug Visualization
## Executive Summary
Currently, probability distributions in the debug tab vary in position and shape based on the selected topic, making it difficult to assess the effectiveness of difficulty-based Gaussian targeting across different themes. This document proposes implementing distribution normalization to create consistent, topic-independent visualizations that clearly reveal algorithmic behavior.
## Current Problem
### Topic-Dependent Distribution Shifts
The current visualization shows probability distributions that vary significantly based on the input topic:
```
Topic: "animals" β Peak around position 60-80
Topic: "technology" β Peak around position 30-50
Topic: "history" β Peak around position 40-70
```
This variation occurs because different topics produce different ranges of similarity scores:
- High-similarity topics (e.g., "technology" β "TECH") compress the distribution leftward
- Lower-similarity topics spread the distribution more broadly
- The Gaussian frequency targeting gets masked by these topic-specific effects
### Visualization Challenges
1. **Inconsistent Baselines**: Each topic creates a different baseline probability distribution
2. **Difficult Comparison**: Cannot easily compare difficulty effectiveness across topics
3. **Masked Patterns**: The intended Gaussian targeting patterns get obscured by topic bias
4. **Misleading Statistics**: Mean (ΞΌ) and sigma (Ο) positions vary dramatically between topics
## Benefits of Normalization
### 1. Consistent Difficulty Targeting Visualization
With normalization, each difficulty level would show:
- **Easy Mode**: Always peaks at the same visual position (90th percentile zone)
- **Medium Mode**: Always centers around 50th percentile zone
- **Hard Mode**: Always concentrates in 20th percentile zone
### 2. Topic-Independent Analysis
```
Normalized View:
Easy (animals): βββββββββββββββββ (peak at 90%)
Easy (technology): βββββββββββββββββ (peak at 90%)
Easy (history): βββββββββββββββββ (peak at 90%)
```
All topics would produce visually identical patterns for the same difficulty level.
### 3. Enhanced Diagnostic Capability
- Immediately spot when Gaussian targeting is failing
- Compare algorithm performance across different topic domains
- Validate that composite scoring weights are working correctly
- Identify topics that produce unusual similarity score distributions
## Implementation Strategies
### Option 1: Min-Max Normalization (Recommended)
**Formula:**
```python
normalized_probability = (probability - min_prob) / (max_prob - min_prob)
```
**Benefits:**
- Preserves relative probability relationships
- Maps all distributions to [0, 1] range
- Simple to implement and understand
- Maintains the shape of the original distribution
**Implementation:**
```python
def normalize_probability_distribution(probabilities):
probs = [p["probability"] for p in probabilities]
min_prob, max_prob = min(probs), max(probs)
if max_prob == min_prob: # Handle edge case
return probabilities
for item in probabilities:
item["normalized_probability"] = (
item["probability"] - min_prob
) / (max_prob - min_prob)
return probabilities
```
### Option 2: Z-Score Normalization
**Formula:**
```python
normalized = (probability - mean_prob) / std_dev_prob
```
**Benefits:**
- Centers all distributions around 0
- Shows standard deviations from mean
- Good for statistical analysis
**Drawbacks:**
- Negative values can be confusing in UI
- Requires additional explanation for users
### Option 3: Percentile Rank Normalization
**Formula:**
```python
normalized = percentile_rank(probability, all_probabilities) / 100
```
**Benefits:**
- Maps to [0, 1] range based on rank
- Emphasizes relative positioning
- Less sensitive to outliers
**Drawbacks:**
- Loses information about absolute probability differences
- Can flatten important distinctions
## Visual Impact Examples
### Before Normalization (Current State)
```
Animals Easy: ββββββββββββββββββββ (peak at position 60)
Tech Easy: ββββββββββββββββββββ (peak at position 30)
History Easy: ββββββββββββββββββββ (peak at position 45)
```
### After Normalization (Proposed)
```
Animals Easy: ββββββββββββββββββββ (normalized peak at 90%)
Tech Easy: ββββββββββββββββββββ (normalized peak at 90%)
History Easy: ββββββββββββββββββββ (normalized peak at 90%)
```
## Recommended Implementation Approach
### Phase 1: Data Collection Enhancement
Modify the backend to include normalization data:
```python
# In thematic_word_service.py _softmax_weighted_selection()
prob_distribution = {
"probabilities": probability_data,
"raw_stats": {
"min_probability": min_prob,
"max_probability": max_prob,
"mean_probability": mean_prob,
"std_probability": std_prob
},
"normalized_probabilities": normalized_data
}
```
### Phase 2: Frontend Visualization Options
Add toggle buttons in the debug tab:
- **Raw Distribution**: Current behavior (for debugging)
- **Normalized Distribution**: New normalized view (for analysis)
- **Side-by-Side**: Show both for comparison
### Phase 3: Enhanced Statistical Markers
With normalization, the statistical markers (ΞΌ, Ο) become more meaningful:
- ΞΌ should consistently align with difficulty targets (20%, 50%, 90%)
- Ο should show consistent widths across topics for the same difficulty
- Deviations from expected positions indicate algorithmic issues
## Expected Outcomes
### Successful Implementation Indicators
1. **Visual Consistency**: All easy mode distributions peak at the same normalized position
2. **Clear Difficulty Separation**: Easy, Medium, Hard show distinct, predictable patterns
3. **Topic Independence**: Changing topics doesn't change the distribution shape/position
4. **Diagnostic Power**: Algorithm issues become immediately obvious
### Validation Tests
```python
# Test cases to validate normalization
test_cases = [
("animals", "easy"),
("technology", "easy"),
("history", "easy"),
# Should all produce identical normalized distributions
]
for topic, difficulty in test_cases:
distribution = generate_normalized_distribution(topic, difficulty)
assert peak_position(distribution) == EXPECTED_EASY_PEAK
assert distribution_width(distribution) == EXPECTED_EASY_WIDTH
```
## Implementation Timeline
### Week 1: Backend Changes
- Modify `_softmax_weighted_selection()` to compute normalization statistics
- Add normalized probability calculation
- Update debug data structure
- Add unit tests
### Week 2: Frontend Integration
- Add normalization toggle to debug tab
- Implement normalized chart rendering
- Update statistical marker calculations
- Add explanatory tooltips
### Week 3: Testing & Validation
- Test across multiple topics and difficulties
- Validate that normalization reveals expected patterns
- Document findings and create examples
- Performance optimization if needed
## Future Enhancements
### Dynamic Normalization Scopes
- **Per-topic normalization**: Normalize within each topic separately
- **Cross-topic normalization**: Normalize across all topics globally
- **Per-difficulty normalization**: Normalize within difficulty levels
### Advanced Statistical Views
- **Overlay comparisons**: Show multiple topics/difficulties on same chart
- **Animation**: Transition between raw and normalized views
- **Heatmap visualization**: Show 2D difficultyΓtopic probability landscapes
## Risk Mitigation
### Potential Issues
1. **Information Loss**: Normalization might hide important absolute differences
2. **User Confusion**: Additional complexity in the interface
3. **Performance**: Extra computation for large datasets
### Mitigation Strategies
1. **Always provide raw view option**: Never remove the original visualization
2. **Clear labeling**: Explicitly indicate when normalization is active
3. **Efficient algorithms**: Use vectorized operations for normalization
## Conclusion
Distribution normalization will transform the debug visualization from a topic-specific diagnostic tool into a universal algorithm validation system. By removing topic-dependent bias, we can clearly see whether the Gaussian frequency targeting is working as designed, regardless of the input theme.
The recommended min-max normalization approach preserves the essential characteristics of the probability distributions while ensuring consistent, comparable visualizations across all topics and difficulties.
This enhancement will significantly improve the ability to:
- Validate algorithm correctness
- Debug difficulty-targeting issues
- Compare performance across different domains
- Demonstrate the effectiveness of the composite scoring system
---
*This proposal builds on the successful percentile-sorted visualization implementation to create an even more powerful debugging and analysis tool.* |