File size: 6,907 Bytes
53e35dc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
# Distribution Normalization Analysis

## Overview

Distribution normalization is a feature implemented to ensure consistent difficulty levels across different topics in the crossword generator. This document analyzes the trade-offs between normalized and non-normalized approaches and provides recommendations.

## The Problem

The original question was: *"Should we normalize the distribution before display? Perhaps the distribution will be centered at the same position for a difficulty level irrespective of topic."*

Different topics naturally have different semantic similarity ranges:
- **"Animals"**: Rich vocabulary, similarities often range 0.4-0.9
- **"Philosophy"**: Abstract concepts, similarities might range 0.1-0.6  
- **"Technology"**: Mixed range, similarities around 0.2-0.8

This led to perceived "inconsistent difficulty" where "Easy Animals" felt easier than "Easy Philosophy" crosswords.

## Current Implementation

### Composite Score Formula
```
composite = (1 - difficulty_weight) * similarity + difficulty_weight * freq_score
```

With default `difficulty_weight = 0.5`:
```
composite = 0.5 * similarity + 0.5 * freq_score
```

### Normalization Methods

1. **`similarity_range` (default)**: Normalizes similarities to [0,1] before composite calculation
2. **`composite_zscore`**: Z-score normalization (unbounded, typically -3 to +3)
3. **`percentile_recentering`**: Boosts scores based on proximity to target percentile (can exceed 1.0)

### Configuration
- `ENABLE_DISTRIBUTION_NORMALIZATION=true` (default)
- `NORMALIZATION_METHOD=similarity_range` (default)

## Trade-offs Analysis

### Before Normalization (Original System)

#### Advantages ✅
1. **Natural semantic relationships preserved**
   - Topics with broader vocabulary naturally had higher similarity ranges
   - Reflected genuine linguistic density differences
   - Authentic representation of semantic space

2. **Simpler and more predictable**
   - Straightforward composite score calculation
   - Always bounded to [0,1] naturally
   - No artificial transformations

3. **Semantic honesty**
   - Some topics ARE inherently harder to generate crosswords for
   - System reflected this reality rather than masking it
   - Valuable information for both system and users

4. **Computational efficiency**
   - No additional normalization calculations
   - Cleaner code path

#### Disadvantages ❌
1. **Inconsistent difficulty across topics**
   - "Easy" for animals genuinely easier than "Easy" for philosophy
   - Could confuse users expecting uniform difficulty

2. **User expectation mismatch**
   - Players might expect same difficulty label = same challenge level

### After Normalization (Current System)

#### Advantages ✅
1. **Consistent difficulty intent**
   - Attempts to make "Easy" equally easy across all topics
   - Meets user expectations for uniform difficulty labels

2. **Debug visualization enhancements**
   - Shows normalization effects in debug tab
   - Helpful for analysis and understanding

#### Disadvantages ❌
1. **Artificial stretching of similarity ranges**
   - Forces sparse topics to use full [0,1] range
   - Genuinely dissimilar words appear artificially similar
   - Loss of semantic authenticity

2. **Implementation complexity and bugs**
   - Different methods produce different ranges
   - Z-score normalization is unbounded
   - Percentile recentering can exceed 1.0
   - Softmax sensitivity to inconsistent ranges

3. **Loss of valuable information**
   - Masks natural vocabulary density differences
   - Hides genuine topic difficulty characteristics
   - Makes debugging harder (what's "real" vs "normalized"?)

4. **Computational overhead**
   - Additional calculations for normalization
   - More complex code paths
   - Potential for numerical issues

## Composite Score Ranges

### Without Normalization
- **Theoretical range**: [0, 1]
- **Practical range**: Depends on actual similarities in the 150-word thematic pool
- **Example**: If similarities range 0.3-0.7, composite ≈ [0.15, 0.85]

### With Normalization
- **`similarity_range`**: ~[0, 1] (most consistent)
- **`composite_zscore`**: Unbounded (typically [-3, +3])
- **`percentile_recentering`**: Can exceed 1.0 due to boosting

## Problems with Current Implementation

1. **Range inconsistency**: Different normalization methods produce different ranges
2. **Unbounded z-scores**: Affect softmax probability calculations unpredictably  
3. **Values exceeding [0,1]**: Break assumptions about composite score bounds
4. **Complexity without clear benefit**: Added complexity for questionable gains

## Recommendation

### **Revert to Non-Normalized Approach** 

The original system was **better** for these reasons:

1. **The "problem" wasn't really a problem**
   - Different topics having different difficulty distributions is natural and informative
   - Philosophy IS harder to make crosswords for than animals - this is linguistic reality

2. **Normalization introduces distortions**
   - Stretching narrow ranges doesn't make words more semantically similar
   - Creates artificial relationships that don't exist

3. **Alternative solutions are better**:
   - Show users the natural difficulty of each topic
   - Adjust word count based on topic vocabulary density  
   - Provide topic difficulty ratings to set expectations
   - Use adaptive difficulty within topics rather than across them

### If Normalization is Kept

If normalization must be retained:

1. **Make it opt-in, not default**: `ENABLE_DISTRIBUTION_NORMALIZATION=false`
2. **Fix range consistency**: Ensure all methods produce [0,1] outputs
3. **Add proper bounds checking**: Clamp scores to [0,1] after normalization
4. **Document trade-offs clearly**: Let users make informed choices

## Proposed Implementation Fixes

If keeping normalization, fix these issues:

```python
# After normalization, ensure consistent [0,1] range
if method == "composite_zscore":
    # Map z-scores to [0,1] using sigmoid
    scores = 1 / (1 + np.exp(-normalized_scores))
elif method == "percentile_recentering":
    # Clamp boosted scores to valid range
    scores = np.clip(boosted_scores, 0, 1)

# Final safety clamp for all methods
composite_scores = np.clip(composite_scores, 0, 1)
```

## Conclusion

The **non-normalized approach respects semantic reality** and provides more honest, interpretable results. The "inconsistency" across topics is actually valuable information about linguistic structure, not a bug to be fixed.

**Recommendation**: Disable normalization by default (`ENABLE_DISTRIBUTION_NORMALIZATION=false`) and let the natural semantic relationships guide difficulty distribution. This preserves the system's authenticity while maintaining simplicity and predictability.

The original system's variation across topics was a **feature representing real linguistic diversity**, not a problem requiring artificial correction.