# Distribution Normalization Analysis ## Overview Distribution normalization is a feature implemented to ensure consistent difficulty levels across different topics in the crossword generator. This document analyzes the trade-offs between normalized and non-normalized approaches and provides recommendations. ## The Problem The original question was: *"Should we normalize the distribution before display? Perhaps the distribution will be centered at the same position for a difficulty level irrespective of topic."* Different topics naturally have different semantic similarity ranges: - **"Animals"**: Rich vocabulary, similarities often range 0.4-0.9 - **"Philosophy"**: Abstract concepts, similarities might range 0.1-0.6 - **"Technology"**: Mixed range, similarities around 0.2-0.8 This led to perceived "inconsistent difficulty" where "Easy Animals" felt easier than "Easy Philosophy" crosswords. ## Current Implementation ### Composite Score Formula ``` composite = (1 - difficulty_weight) * similarity + difficulty_weight * freq_score ``` With default `difficulty_weight = 0.5`: ``` composite = 0.5 * similarity + 0.5 * freq_score ``` ### Normalization Methods 1. **`similarity_range` (default)**: Normalizes similarities to [0,1] before composite calculation 2. **`composite_zscore`**: Z-score normalization (unbounded, typically -3 to +3) 3. **`percentile_recentering`**: Boosts scores based on proximity to target percentile (can exceed 1.0) ### Configuration - `ENABLE_DISTRIBUTION_NORMALIZATION=true` (default) - `NORMALIZATION_METHOD=similarity_range` (default) ## Trade-offs Analysis ### Before Normalization (Original System) #### Advantages ✅ 1. **Natural semantic relationships preserved** - Topics with broader vocabulary naturally had higher similarity ranges - Reflected genuine linguistic density differences - Authentic representation of semantic space 2. **Simpler and more predictable** - Straightforward composite score calculation - Always bounded to [0,1] naturally - No artificial transformations 3. **Semantic honesty** - Some topics ARE inherently harder to generate crosswords for - System reflected this reality rather than masking it - Valuable information for both system and users 4. **Computational efficiency** - No additional normalization calculations - Cleaner code path #### Disadvantages ❌ 1. **Inconsistent difficulty across topics** - "Easy" for animals genuinely easier than "Easy" for philosophy - Could confuse users expecting uniform difficulty 2. **User expectation mismatch** - Players might expect same difficulty label = same challenge level ### After Normalization (Current System) #### Advantages ✅ 1. **Consistent difficulty intent** - Attempts to make "Easy" equally easy across all topics - Meets user expectations for uniform difficulty labels 2. **Debug visualization enhancements** - Shows normalization effects in debug tab - Helpful for analysis and understanding #### Disadvantages ❌ 1. **Artificial stretching of similarity ranges** - Forces sparse topics to use full [0,1] range - Genuinely dissimilar words appear artificially similar - Loss of semantic authenticity 2. **Implementation complexity and bugs** - Different methods produce different ranges - Z-score normalization is unbounded - Percentile recentering can exceed 1.0 - Softmax sensitivity to inconsistent ranges 3. **Loss of valuable information** - Masks natural vocabulary density differences - Hides genuine topic difficulty characteristics - Makes debugging harder (what's "real" vs "normalized"?) 4. **Computational overhead** - Additional calculations for normalization - More complex code paths - Potential for numerical issues ## Composite Score Ranges ### Without Normalization - **Theoretical range**: [0, 1] - **Practical range**: Depends on actual similarities in the 150-word thematic pool - **Example**: If similarities range 0.3-0.7, composite ≈ [0.15, 0.85] ### With Normalization - **`similarity_range`**: ~[0, 1] (most consistent) - **`composite_zscore`**: Unbounded (typically [-3, +3]) - **`percentile_recentering`**: Can exceed 1.0 due to boosting ## Problems with Current Implementation 1. **Range inconsistency**: Different normalization methods produce different ranges 2. **Unbounded z-scores**: Affect softmax probability calculations unpredictably 3. **Values exceeding [0,1]**: Break assumptions about composite score bounds 4. **Complexity without clear benefit**: Added complexity for questionable gains ## Recommendation ### **Revert to Non-Normalized Approach** The original system was **better** for these reasons: 1. **The "problem" wasn't really a problem** - Different topics having different difficulty distributions is natural and informative - Philosophy IS harder to make crosswords for than animals - this is linguistic reality 2. **Normalization introduces distortions** - Stretching narrow ranges doesn't make words more semantically similar - Creates artificial relationships that don't exist 3. **Alternative solutions are better**: - Show users the natural difficulty of each topic - Adjust word count based on topic vocabulary density - Provide topic difficulty ratings to set expectations - Use adaptive difficulty within topics rather than across them ### If Normalization is Kept If normalization must be retained: 1. **Make it opt-in, not default**: `ENABLE_DISTRIBUTION_NORMALIZATION=false` 2. **Fix range consistency**: Ensure all methods produce [0,1] outputs 3. **Add proper bounds checking**: Clamp scores to [0,1] after normalization 4. **Document trade-offs clearly**: Let users make informed choices ## Proposed Implementation Fixes If keeping normalization, fix these issues: ```python # After normalization, ensure consistent [0,1] range if method == "composite_zscore": # Map z-scores to [0,1] using sigmoid scores = 1 / (1 + np.exp(-normalized_scores)) elif method == "percentile_recentering": # Clamp boosted scores to valid range scores = np.clip(boosted_scores, 0, 1) # Final safety clamp for all methods composite_scores = np.clip(composite_scores, 0, 1) ``` ## Conclusion The **non-normalized approach respects semantic reality** and provides more honest, interpretable results. The "inconsistency" across topics is actually valuable information about linguistic structure, not a bug to be fixed. **Recommendation**: Disable normalization by default (`ENABLE_DISTRIBUTION_NORMALIZATION=false`) and let the natural semantic relationships guide difficulty distribution. This preserves the system's authenticity while maintaining simplicity and predictability. The original system's variation across topics was a **feature representing real linguistic diversity**, not a problem requiring artificial correction.