File size: 6,907 Bytes
53e35dc |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 |
# Distribution Normalization Analysis
## Overview
Distribution normalization is a feature implemented to ensure consistent difficulty levels across different topics in the crossword generator. This document analyzes the trade-offs between normalized and non-normalized approaches and provides recommendations.
## The Problem
The original question was: *"Should we normalize the distribution before display? Perhaps the distribution will be centered at the same position for a difficulty level irrespective of topic."*
Different topics naturally have different semantic similarity ranges:
- **"Animals"**: Rich vocabulary, similarities often range 0.4-0.9
- **"Philosophy"**: Abstract concepts, similarities might range 0.1-0.6
- **"Technology"**: Mixed range, similarities around 0.2-0.8
This led to perceived "inconsistent difficulty" where "Easy Animals" felt easier than "Easy Philosophy" crosswords.
## Current Implementation
### Composite Score Formula
```
composite = (1 - difficulty_weight) * similarity + difficulty_weight * freq_score
```
With default `difficulty_weight = 0.5`:
```
composite = 0.5 * similarity + 0.5 * freq_score
```
### Normalization Methods
1. **`similarity_range` (default)**: Normalizes similarities to [0,1] before composite calculation
2. **`composite_zscore`**: Z-score normalization (unbounded, typically -3 to +3)
3. **`percentile_recentering`**: Boosts scores based on proximity to target percentile (can exceed 1.0)
### Configuration
- `ENABLE_DISTRIBUTION_NORMALIZATION=true` (default)
- `NORMALIZATION_METHOD=similarity_range` (default)
## Trade-offs Analysis
### Before Normalization (Original System)
#### Advantages ✅
1. **Natural semantic relationships preserved**
- Topics with broader vocabulary naturally had higher similarity ranges
- Reflected genuine linguistic density differences
- Authentic representation of semantic space
2. **Simpler and more predictable**
- Straightforward composite score calculation
- Always bounded to [0,1] naturally
- No artificial transformations
3. **Semantic honesty**
- Some topics ARE inherently harder to generate crosswords for
- System reflected this reality rather than masking it
- Valuable information for both system and users
4. **Computational efficiency**
- No additional normalization calculations
- Cleaner code path
#### Disadvantages ❌
1. **Inconsistent difficulty across topics**
- "Easy" for animals genuinely easier than "Easy" for philosophy
- Could confuse users expecting uniform difficulty
2. **User expectation mismatch**
- Players might expect same difficulty label = same challenge level
### After Normalization (Current System)
#### Advantages ✅
1. **Consistent difficulty intent**
- Attempts to make "Easy" equally easy across all topics
- Meets user expectations for uniform difficulty labels
2. **Debug visualization enhancements**
- Shows normalization effects in debug tab
- Helpful for analysis and understanding
#### Disadvantages ❌
1. **Artificial stretching of similarity ranges**
- Forces sparse topics to use full [0,1] range
- Genuinely dissimilar words appear artificially similar
- Loss of semantic authenticity
2. **Implementation complexity and bugs**
- Different methods produce different ranges
- Z-score normalization is unbounded
- Percentile recentering can exceed 1.0
- Softmax sensitivity to inconsistent ranges
3. **Loss of valuable information**
- Masks natural vocabulary density differences
- Hides genuine topic difficulty characteristics
- Makes debugging harder (what's "real" vs "normalized"?)
4. **Computational overhead**
- Additional calculations for normalization
- More complex code paths
- Potential for numerical issues
## Composite Score Ranges
### Without Normalization
- **Theoretical range**: [0, 1]
- **Practical range**: Depends on actual similarities in the 150-word thematic pool
- **Example**: If similarities range 0.3-0.7, composite ≈ [0.15, 0.85]
### With Normalization
- **`similarity_range`**: ~[0, 1] (most consistent)
- **`composite_zscore`**: Unbounded (typically [-3, +3])
- **`percentile_recentering`**: Can exceed 1.0 due to boosting
## Problems with Current Implementation
1. **Range inconsistency**: Different normalization methods produce different ranges
2. **Unbounded z-scores**: Affect softmax probability calculations unpredictably
3. **Values exceeding [0,1]**: Break assumptions about composite score bounds
4. **Complexity without clear benefit**: Added complexity for questionable gains
## Recommendation
### **Revert to Non-Normalized Approach**
The original system was **better** for these reasons:
1. **The "problem" wasn't really a problem**
- Different topics having different difficulty distributions is natural and informative
- Philosophy IS harder to make crosswords for than animals - this is linguistic reality
2. **Normalization introduces distortions**
- Stretching narrow ranges doesn't make words more semantically similar
- Creates artificial relationships that don't exist
3. **Alternative solutions are better**:
- Show users the natural difficulty of each topic
- Adjust word count based on topic vocabulary density
- Provide topic difficulty ratings to set expectations
- Use adaptive difficulty within topics rather than across them
### If Normalization is Kept
If normalization must be retained:
1. **Make it opt-in, not default**: `ENABLE_DISTRIBUTION_NORMALIZATION=false`
2. **Fix range consistency**: Ensure all methods produce [0,1] outputs
3. **Add proper bounds checking**: Clamp scores to [0,1] after normalization
4. **Document trade-offs clearly**: Let users make informed choices
## Proposed Implementation Fixes
If keeping normalization, fix these issues:
```python
# After normalization, ensure consistent [0,1] range
if method == "composite_zscore":
# Map z-scores to [0,1] using sigmoid
scores = 1 / (1 + np.exp(-normalized_scores))
elif method == "percentile_recentering":
# Clamp boosted scores to valid range
scores = np.clip(boosted_scores, 0, 1)
# Final safety clamp for all methods
composite_scores = np.clip(composite_scores, 0, 1)
```
## Conclusion
The **non-normalized approach respects semantic reality** and provides more honest, interpretable results. The "inconsistency" across topics is actually valuable information about linguistic structure, not a bug to be fixed.
**Recommendation**: Disable normalization by default (`ENABLE_DISTRIBUTION_NORMALIZATION=false`) and let the natural semantic relationships guide difficulty distribution. This preserves the system's authenticity while maintaining simplicity and predictability.
The original system's variation across topics was a **feature representing real linguistic diversity**, not a problem requiring artificial correction. |