abc123 / crossword-app /backend-py /docs /distribution_normalization_analysis.md
vimalk78's picture
feat: implement distribution normalization with default disabled
53e35dc

Distribution Normalization Analysis

Overview

Distribution normalization is a feature implemented to ensure consistent difficulty levels across different topics in the crossword generator. This document analyzes the trade-offs between normalized and non-normalized approaches and provides recommendations.

The Problem

The original question was: "Should we normalize the distribution before display? Perhaps the distribution will be centered at the same position for a difficulty level irrespective of topic."

Different topics naturally have different semantic similarity ranges:

  • "Animals": Rich vocabulary, similarities often range 0.4-0.9
  • "Philosophy": Abstract concepts, similarities might range 0.1-0.6
  • "Technology": Mixed range, similarities around 0.2-0.8

This led to perceived "inconsistent difficulty" where "Easy Animals" felt easier than "Easy Philosophy" crosswords.

Current Implementation

Composite Score Formula

composite = (1 - difficulty_weight) * similarity + difficulty_weight * freq_score

With default difficulty_weight = 0.5:

composite = 0.5 * similarity + 0.5 * freq_score

Normalization Methods

  1. similarity_range (default): Normalizes similarities to [0,1] before composite calculation
  2. composite_zscore: Z-score normalization (unbounded, typically -3 to +3)
  3. percentile_recentering: Boosts scores based on proximity to target percentile (can exceed 1.0)

Configuration

  • ENABLE_DISTRIBUTION_NORMALIZATION=true (default)
  • NORMALIZATION_METHOD=similarity_range (default)

Trade-offs Analysis

Before Normalization (Original System)

Advantages βœ…

  1. Natural semantic relationships preserved

    • Topics with broader vocabulary naturally had higher similarity ranges
    • Reflected genuine linguistic density differences
    • Authentic representation of semantic space
  2. Simpler and more predictable

    • Straightforward composite score calculation
    • Always bounded to [0,1] naturally
    • No artificial transformations
  3. Semantic honesty

    • Some topics ARE inherently harder to generate crosswords for
    • System reflected this reality rather than masking it
    • Valuable information for both system and users
  4. Computational efficiency

    • No additional normalization calculations
    • Cleaner code path

Disadvantages ❌

  1. Inconsistent difficulty across topics

    • "Easy" for animals genuinely easier than "Easy" for philosophy
    • Could confuse users expecting uniform difficulty
  2. User expectation mismatch

    • Players might expect same difficulty label = same challenge level

After Normalization (Current System)

Advantages βœ…

  1. Consistent difficulty intent

    • Attempts to make "Easy" equally easy across all topics
    • Meets user expectations for uniform difficulty labels
  2. Debug visualization enhancements

    • Shows normalization effects in debug tab
    • Helpful for analysis and understanding

Disadvantages ❌

  1. Artificial stretching of similarity ranges

    • Forces sparse topics to use full [0,1] range
    • Genuinely dissimilar words appear artificially similar
    • Loss of semantic authenticity
  2. Implementation complexity and bugs

    • Different methods produce different ranges
    • Z-score normalization is unbounded
    • Percentile recentering can exceed 1.0
    • Softmax sensitivity to inconsistent ranges
  3. Loss of valuable information

    • Masks natural vocabulary density differences
    • Hides genuine topic difficulty characteristics
    • Makes debugging harder (what's "real" vs "normalized"?)
  4. Computational overhead

    • Additional calculations for normalization
    • More complex code paths
    • Potential for numerical issues

Composite Score Ranges

Without Normalization

  • Theoretical range: [0, 1]
  • Practical range: Depends on actual similarities in the 150-word thematic pool
  • Example: If similarities range 0.3-0.7, composite β‰ˆ [0.15, 0.85]

With Normalization

  • similarity_range: ~[0, 1] (most consistent)
  • composite_zscore: Unbounded (typically [-3, +3])
  • percentile_recentering: Can exceed 1.0 due to boosting

Problems with Current Implementation

  1. Range inconsistency: Different normalization methods produce different ranges
  2. Unbounded z-scores: Affect softmax probability calculations unpredictably
  3. Values exceeding [0,1]: Break assumptions about composite score bounds
  4. Complexity without clear benefit: Added complexity for questionable gains

Recommendation

Revert to Non-Normalized Approach

The original system was better for these reasons:

  1. The "problem" wasn't really a problem

    • Different topics having different difficulty distributions is natural and informative
    • Philosophy IS harder to make crosswords for than animals - this is linguistic reality
  2. Normalization introduces distortions

    • Stretching narrow ranges doesn't make words more semantically similar
    • Creates artificial relationships that don't exist
  3. Alternative solutions are better:

    • Show users the natural difficulty of each topic
    • Adjust word count based on topic vocabulary density
    • Provide topic difficulty ratings to set expectations
    • Use adaptive difficulty within topics rather than across them

If Normalization is Kept

If normalization must be retained:

  1. Make it opt-in, not default: ENABLE_DISTRIBUTION_NORMALIZATION=false
  2. Fix range consistency: Ensure all methods produce [0,1] outputs
  3. Add proper bounds checking: Clamp scores to [0,1] after normalization
  4. Document trade-offs clearly: Let users make informed choices

Proposed Implementation Fixes

If keeping normalization, fix these issues:

# After normalization, ensure consistent [0,1] range
if method == "composite_zscore":
    # Map z-scores to [0,1] using sigmoid
    scores = 1 / (1 + np.exp(-normalized_scores))
elif method == "percentile_recentering":
    # Clamp boosted scores to valid range
    scores = np.clip(boosted_scores, 0, 1)

# Final safety clamp for all methods
composite_scores = np.clip(composite_scores, 0, 1)

Conclusion

The non-normalized approach respects semantic reality and provides more honest, interpretable results. The "inconsistency" across topics is actually valuable information about linguistic structure, not a bug to be fixed.

Recommendation: Disable normalization by default (ENABLE_DISTRIBUTION_NORMALIZATION=false) and let the natural semantic relationships guide difficulty distribution. This preserves the system's authenticity while maintaining simplicity and predictability.

The original system's variation across topics was a feature representing real linguistic diversity, not a problem requiring artificial correction.