Spaces:

vimalk78
/

abc123

Sleeping

App Files Files Community

abc123 / crossword-app /backend-py /docs /distribution_normalization_analysis.md

vimalk78

feat: implement distribution normalization with default disabled

53e35dc 4 months ago

preview code

raw

history blame contribute delete

6.91 kB

Distribution Normalization Analysis

Overview

Distribution normalization is a feature implemented to ensure consistent difficulty levels across different topics in the crossword generator. This document analyzes the trade-offs between normalized and non-normalized approaches and provides recommendations.

The Problem

The original question was: "Should we normalize the distribution before display? Perhaps the distribution will be centered at the same position for a difficulty level irrespective of topic."

Different topics naturally have different semantic similarity ranges:

"Animals": Rich vocabulary, similarities often range 0.4-0.9
"Philosophy": Abstract concepts, similarities might range 0.1-0.6
"Technology": Mixed range, similarities around 0.2-0.8

This led to perceived "inconsistent difficulty" where "Easy Animals" felt easier than "Easy Philosophy" crosswords.

Current Implementation

Composite Score Formula

composite = (1 - difficulty_weight) * similarity + difficulty_weight * freq_score

With default difficulty_weight = 0.5:

composite = 0.5 * similarity + 0.5 * freq_score

Normalization Methods

similarity_range (default): Normalizes similarities to [0,1] before composite calculation
composite_zscore: Z-score normalization (unbounded, typically -3 to +3)
percentile_recentering: Boosts scores based on proximity to target percentile (can exceed 1.0)

Configuration

ENABLE_DISTRIBUTION_NORMALIZATION=true (default)
NORMALIZATION_METHOD=similarity_range (default)

Trade-offs Analysis

Before Normalization (Original System)

Advantages ✅

Natural semantic relationships preserved
- Topics with broader vocabulary naturally had higher similarity ranges
- Reflected genuine linguistic density differences
- Authentic representation of semantic space
Simpler and more predictable
- Straightforward composite score calculation
- Always bounded to [0,1] naturally
- No artificial transformations
Semantic honesty
- Some topics ARE inherently harder to generate crosswords for
- System reflected this reality rather than masking it
- Valuable information for both system and users
Computational efficiency
- No additional normalization calculations
- Cleaner code path

Disadvantages ❌

Inconsistent difficulty across topics
- "Easy" for animals genuinely easier than "Easy" for philosophy
- Could confuse users expecting uniform difficulty
User expectation mismatch
- Players might expect same difficulty label = same challenge level

After Normalization (Current System)

Advantages ✅

Consistent difficulty intent
- Attempts to make "Easy" equally easy across all topics
- Meets user expectations for uniform difficulty labels
Debug visualization enhancements
- Shows normalization effects in debug tab
- Helpful for analysis and understanding

Disadvantages ❌

Artificial stretching of similarity ranges
- Forces sparse topics to use full [0,1] range
- Genuinely dissimilar words appear artificially similar
- Loss of semantic authenticity
Implementation complexity and bugs
- Different methods produce different ranges
- Z-score normalization is unbounded
- Percentile recentering can exceed 1.0
- Softmax sensitivity to inconsistent ranges
Loss of valuable information
- Masks natural vocabulary density differences
- Hides genuine topic difficulty characteristics
- Makes debugging harder (what's "real" vs "normalized"?)
Computational overhead
- Additional calculations for normalization
- More complex code paths
- Potential for numerical issues

Composite Score Ranges

Without Normalization

Theoretical range: [0, 1]
Practical range: Depends on actual similarities in the 150-word thematic pool
Example: If similarities range 0.3-0.7, composite ≈ [0.15, 0.85]

With Normalization

similarity_range: ~[0, 1] (most consistent)
composite_zscore: Unbounded (typically [-3, +3])
percentile_recentering: Can exceed 1.0 due to boosting

Problems with Current Implementation

Range inconsistency: Different normalization methods produce different ranges
Unbounded z-scores: Affect softmax probability calculations unpredictably
Values exceeding [0,1]: Break assumptions about composite score bounds
Complexity without clear benefit: Added complexity for questionable gains

Recommendation

Revert to Non-Normalized Approach

The original system was better for these reasons:

The "problem" wasn't really a problem
- Different topics having different difficulty distributions is natural and informative
- Philosophy IS harder to make crosswords for than animals - this is linguistic reality
Normalization introduces distortions
- Stretching narrow ranges doesn't make words more semantically similar
- Creates artificial relationships that don't exist
Alternative solutions are better:
- Show users the natural difficulty of each topic
- Adjust word count based on topic vocabulary density
- Provide topic difficulty ratings to set expectations
- Use adaptive difficulty within topics rather than across them

If Normalization is Kept

If normalization must be retained:

Make it opt-in, not default: ENABLE_DISTRIBUTION_NORMALIZATION=false
Fix range consistency: Ensure all methods produce [0,1] outputs
Add proper bounds checking: Clamp scores to [0,1] after normalization
Document trade-offs clearly: Let users make informed choices

Proposed Implementation Fixes

If keeping normalization, fix these issues:

# After normalization, ensure consistent [0,1] range
if method == "composite_zscore":
    # Map z-scores to [0,1] using sigmoid
    scores = 1 / (1 + np.exp(-normalized_scores))
elif method == "percentile_recentering":
    # Clamp boosted scores to valid range
    scores = np.clip(boosted_scores, 0, 1)

# Final safety clamp for all methods
composite_scores = np.clip(composite_scores, 0, 1)

Conclusion

The non-normalized approach respects semantic reality and provides more honest, interpretable results. The "inconsistency" across topics is actually valuable information about linguistic structure, not a bug to be fixed.

Recommendation: Disable normalization by default (ENABLE_DISTRIBUTION_NORMALIZATION=false) and let the natural semantic relationships guide difficulty distribution. This preserves the system's authenticity while maintaining simplicity and predictability.

The original system's variation across topics was a feature representing real linguistic diversity, not a problem requiring artificial correction.