Distribution Normalization Analysis
Overview
Distribution normalization is a feature implemented to ensure consistent difficulty levels across different topics in the crossword generator. This document analyzes the trade-offs between normalized and non-normalized approaches and provides recommendations.
The Problem
The original question was: "Should we normalize the distribution before display? Perhaps the distribution will be centered at the same position for a difficulty level irrespective of topic."
Different topics naturally have different semantic similarity ranges:
- "Animals": Rich vocabulary, similarities often range 0.4-0.9
- "Philosophy": Abstract concepts, similarities might range 0.1-0.6
- "Technology": Mixed range, similarities around 0.2-0.8
This led to perceived "inconsistent difficulty" where "Easy Animals" felt easier than "Easy Philosophy" crosswords.
Current Implementation
Composite Score Formula
composite = (1 - difficulty_weight) * similarity + difficulty_weight * freq_score
With default difficulty_weight = 0.5:
composite = 0.5 * similarity + 0.5 * freq_score
Normalization Methods
similarity_range(default): Normalizes similarities to [0,1] before composite calculationcomposite_zscore: Z-score normalization (unbounded, typically -3 to +3)percentile_recentering: Boosts scores based on proximity to target percentile (can exceed 1.0)
Configuration
ENABLE_DISTRIBUTION_NORMALIZATION=true(default)NORMALIZATION_METHOD=similarity_range(default)
Trade-offs Analysis
Before Normalization (Original System)
Advantages β
Natural semantic relationships preserved
- Topics with broader vocabulary naturally had higher similarity ranges
- Reflected genuine linguistic density differences
- Authentic representation of semantic space
Simpler and more predictable
- Straightforward composite score calculation
- Always bounded to [0,1] naturally
- No artificial transformations
Semantic honesty
- Some topics ARE inherently harder to generate crosswords for
- System reflected this reality rather than masking it
- Valuable information for both system and users
Computational efficiency
- No additional normalization calculations
- Cleaner code path
Disadvantages β
Inconsistent difficulty across topics
- "Easy" for animals genuinely easier than "Easy" for philosophy
- Could confuse users expecting uniform difficulty
User expectation mismatch
- Players might expect same difficulty label = same challenge level
After Normalization (Current System)
Advantages β
Consistent difficulty intent
- Attempts to make "Easy" equally easy across all topics
- Meets user expectations for uniform difficulty labels
Debug visualization enhancements
- Shows normalization effects in debug tab
- Helpful for analysis and understanding
Disadvantages β
Artificial stretching of similarity ranges
- Forces sparse topics to use full [0,1] range
- Genuinely dissimilar words appear artificially similar
- Loss of semantic authenticity
Implementation complexity and bugs
- Different methods produce different ranges
- Z-score normalization is unbounded
- Percentile recentering can exceed 1.0
- Softmax sensitivity to inconsistent ranges
Loss of valuable information
- Masks natural vocabulary density differences
- Hides genuine topic difficulty characteristics
- Makes debugging harder (what's "real" vs "normalized"?)
Computational overhead
- Additional calculations for normalization
- More complex code paths
- Potential for numerical issues
Composite Score Ranges
Without Normalization
- Theoretical range: [0, 1]
- Practical range: Depends on actual similarities in the 150-word thematic pool
- Example: If similarities range 0.3-0.7, composite β [0.15, 0.85]
With Normalization
similarity_range: ~[0, 1] (most consistent)composite_zscore: Unbounded (typically [-3, +3])percentile_recentering: Can exceed 1.0 due to boosting
Problems with Current Implementation
- Range inconsistency: Different normalization methods produce different ranges
- Unbounded z-scores: Affect softmax probability calculations unpredictably
- Values exceeding [0,1]: Break assumptions about composite score bounds
- Complexity without clear benefit: Added complexity for questionable gains
Recommendation
Revert to Non-Normalized Approach
The original system was better for these reasons:
The "problem" wasn't really a problem
- Different topics having different difficulty distributions is natural and informative
- Philosophy IS harder to make crosswords for than animals - this is linguistic reality
Normalization introduces distortions
- Stretching narrow ranges doesn't make words more semantically similar
- Creates artificial relationships that don't exist
Alternative solutions are better:
- Show users the natural difficulty of each topic
- Adjust word count based on topic vocabulary density
- Provide topic difficulty ratings to set expectations
- Use adaptive difficulty within topics rather than across them
If Normalization is Kept
If normalization must be retained:
- Make it opt-in, not default:
ENABLE_DISTRIBUTION_NORMALIZATION=false - Fix range consistency: Ensure all methods produce [0,1] outputs
- Add proper bounds checking: Clamp scores to [0,1] after normalization
- Document trade-offs clearly: Let users make informed choices
Proposed Implementation Fixes
If keeping normalization, fix these issues:
# After normalization, ensure consistent [0,1] range
if method == "composite_zscore":
# Map z-scores to [0,1] using sigmoid
scores = 1 / (1 + np.exp(-normalized_scores))
elif method == "percentile_recentering":
# Clamp boosted scores to valid range
scores = np.clip(boosted_scores, 0, 1)
# Final safety clamp for all methods
composite_scores = np.clip(composite_scores, 0, 1)
Conclusion
The non-normalized approach respects semantic reality and provides more honest, interpretable results. The "inconsistency" across topics is actually valuable information about linguistic structure, not a bug to be fixed.
Recommendation: Disable normalization by default (ENABLE_DISTRIBUTION_NORMALIZATION=false) and let the natural semantic relationships guide difficulty distribution. This preserves the system's authenticity while maintaining simplicity and predictability.
The original system's variation across topics was a feature representing real linguistic diversity, not a problem requiring artificial correction.